Related
Is it possible for the same exact Mongo ObjectId to be generated for a document in two different collections? I realize that it's definitely very unlikely, but is it possible?
Without getting too specific, the reason I ask is that with an application that I'm working on we show public profiles of elected officials who we hope to convert into full fledged users of our site. We have separate collections for users and the elected officials who aren't currently members of our site. There are various other documents containing various pieces of data about the elected officials that all map back to the person using their elected official ObjectId.
After creating the account we still highlight the data that's associated to the elected official but they now also are a part of the users collection with a corresponding users ObjectId to map their profile to interactions with our application.
We had begun converting our application from MySql to Mongo a few months ago and while we're in transition we store the legacy MySql id for both of these data types and we're also starting to now store the elected official Mongo ObjectId in the users document to map back to the elected official data.
I was pondering just specifying the new user ObjectId as the previous elected official ObjectId to make things simpler but wanted to make sure that it wasn't possible to have a collision with any existing user ObjectId.
Thanks for your insight.
Edit: Shortly after posting this question, I realized that my proposed solution wasn't a very good idea. It would be better to just keep the current schema that we have in place and just link to the elected official '_id' in the users document.
Short Answer
Just to add a direct response to your initial question: YES, if you use BSON Object ID generation, then for most drivers the IDs are almost certainly going to be unique across collections. See below for what "almost certainly" means.
Long Answer
The BSON Object ID's generated by Mongo DB drivers are highly likely to be unique across collections. This is mainly because of the last 3 bytes of the ID, which for most drivers is generated via a static incrementing counter. That counter is collection-independent; it's global. The Java driver, for example, uses a randomly initialized, static AtomicInteger.
So why, in the Mongo docs, do they say that the IDs are "highly likely" to be unique, instead of outright saying that they WILL be unique? Three possibilities can occur where you won't get a unique ID (please let me know if there are more):
Before this discussion, recall that the BSON Object ID consists of:
[4 bytes seconds since epoch, 3 bytes machine hash, 2 bytes process ID, 3 bytes counter]
Here are the three possibilities, so you judge for yourself how likely it is to get a dupe:
1) Counter overflow: there are 3 bytes in the counter. If you happen to insert over 16,777,216 (2^24) documents in a single second, on the same machine, in the same process, then you may overflow the incrementing counter bytes and end up with two Object IDs that share the same time, machine, process, and counter values.
2) Counter non-incrementing: some Mongo drivers use random numbers instead of incrementing numbers for the counter bytes. In these cases, there is a 1/16,777,216 chance of generating a non-unique ID, but only if those two IDs are generated in the same second (i.e. before the time section of the ID updates to the next second), on the same machine, in the same process.
3) Machine and process hash to the same values. The machine ID and process ID values may, in some highly unlikely scenario, map to the same values for two different machines. If this occurs, and at the same time the two counters on the two different machines, during the same second, generate the same value, then you'll end up with a duplicate ID.
These are the three scenarios to watch out for. Scenario 1 and 3 seem highly unlikely, and scenario 2 is totally avoidable if you're using the right driver. You'll have to check the source of the driver to know for sure.
ObjectIds are generated client-side in a manner similar to UUID but with some nicer properties for storage in a database such as roughly increasing order and encoding their creation time for free. The key thing for your use case is that they are designed to guarantee uniqueness to a high probability even if they are generated on different machines.
Now if you were referring to the _id field in general, we do not require uniqueness across collections so it is safe to reuse the old _id. As a concrete example, if you have two collections, colors and fruits, both could simultaneously have an object like {_id: 'orange'}.
In case you want to know more about how ObjectIds are created, here is the spec: http://www.mongodb.org/display/DOCS/Object+IDs#ObjectIDs-BSONObjectIDSpecification
In case anyone is having problems with duplicate Mongo ObjectIDs, you should know that despite the unlikelihood of dups happening in Mongo itself, it is possible to have duplicate _id's generated with PHP in Mongo.
The use-case where this has happened with regularity for me is when I'm looping through a dataset and attempting to inject the data into a collection.
The array that holds the injection data must be explicitly reset on each iteration - even if you aren't specifying the _id value. For some reason, the INSERT process adds the Mongo _id to the array as if it were a global variable (even if the array doesn't have global scope). This can affect you even if you are calling the insertion in a separate function call where you would normally expect the values of the array not to persist back to the calling function.
There are three solutions to this:
You can unset() the _id field from the array
You can reinitialize the entire array with array() each time you loop through your dataset
You can explicitly define the _id value yourself (taking care to define it in such a way that you don't generate dups yourself).
My guess is that this is a bug in the PHP interface, and not so much an issue with Mongo, but if you run into this problem, just unset the _id and you should be fine.
There's no guarantee whatsoever about ObjectId uniqueness across collections. Even if it's probabilistically very unlikely, it would be a very poor application design that relied on _id uniqueness across collections.
One can easily test this in the mongo shell:
MongoDB shell version: 1.6.5
connecting to: test
> db.foo.insert({_id: 'abc'})
> db.bar.insert({_id: 'abc'})
> db.foo.find({_id: 'abc'})
{ "_id" : "abc" }
> db.bar.find({_id: 'abc'})
{ "_id" : "abc" }
> db.foo.insert({_id: 'abc', data:'xyz'})
E11000 duplicate key error index: test.foo.$_id_ dup key: { : "abc" }
So, absolutely don't rely on _id's being unique across collections, and since you don't control the ObjectId generation function, don't rely on it.
It's possible to create something that's more like a uuid, and if you do that manually, you could have some better guarantee of uniqueness.
Remember that you can put objects of different "types" in the same collection, so why not just put your two "tables" in the same collection. They would share the same _id space, and thus, would be guaranteed unique. Switching from "prospective" to "registered" would be a simple flipping of a field...
In my app I'm letting mongo generate order id's via its ObjectId method.
But in user testing we've had some concerns that the order id's are humanly 'intimidating', i.e. if you need to discuss your order with someone over the telephone, reading out 24 alphanumeric characters is a bit tedious.
At the same time, I don't really want to have to store two different id's, one 'human-accessible' and one used by mongo internally.
So my question is this - is there a way to choose a substring of length 6 or even 8 of the mongo objectId string that I could be fairly sure would be unique ?
For example if I have a mongo objectid like this
id = '4b28dcb61083ed3c809e0416'
maybe I could take out
human_id = id.substr(0,7);
and be sure that i'd always get unique id's for my orders...
The advantage of course is that these are orders, and so are human-created, and so there aren't millions of them per millisecond. On the other hand, it would really be a problem if two orders had the same shortened id...
--- clearer explanation ---
I guess a better way to ask my question would be this :
If I decide for example to just use the last 6 characters of a mongo id, is there some kind of measure of 'probability' that just these 6 characters would repeat in a given week ?
Given a certain number of mongo's running in parallel, a certain number of users during the week, etc.
If you have multiple web servers, with multiple processes, then there really isn't something you can remove with losing uniqueness.
If you look at the nature of the ObjectId:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
You'll see there's not much there that you could safely remove. As the first 4 bytes are time, it would be challenging to implement an algorithm that removed portions of the time stamp in a clean and safe way.
The machine identifier and process identifier are used in cases where there are multiple servers and/or processes acting as clients to the database server. If you dropped either of those, you could end up with duplicates again. The random value as the last 3 bytes is used to make sure that two identifiers, on the same machine, within the same process are unique, even when requested frequently.
If you were using it as an order id, and you want assured uniqueness, I wouldn't trim anything away from the 12 byte number as it was carefully designed to provide a robust and efficient distributed mechanism for generating unique numbers when there are many connected database clients.
If you took the last 5 characters of the ObjectId ..., and in a given period, what's the probability of conflict?
process id
counter
The probability of conflict is high. The process id may remain the same through the entire period, and the other number is just an incrementing number that would repeat after 4095 orders. But, if the process recycles, then you also have the chance that there will be a conflict with older orders, etc. And if you're talking multiple database clients, the chances increase as well. I just wouldn't try to trim the number. It's not worth the unhappy customers trying to place orders.
Even the timestamp and the random seed value aren't sufficient when there are multiple database clients generating ObjectIds. As you start to look at the various pieces, especially in the context of a farm of database clients, you should see why the pieces are there, and why removing them could lead to a meltdown in ObjectId generation.
I'd suggest you implement an algorithm to create a unique number and store it in the database. It's simple enough to do. It does impact performance a bit, but it's safe.
I wrote this answer a while ago about the challenges of using an ObjectId in a Url. It includes a link to how to create a unique auto incrementing number using MongoDB.
Actually what you choose for and Id (actually _id in MongoDB storage) is totally up to you. If there is some useful data you can keep in _id as long as you keep it unique, then do so. If it has to be something valid to url encoding, then do so.
By default, if you do not specify an _id then that field will be populated with the value you have come to love and hate. But if you explicitly use it, then you will get what you want.
The extra thing to keep in mind is that even if you specify an addtional unique index field, let's say order_id then MongoDB would actually have to check through that and other indexes on a query plan to see which one was best to use. But if _id was your key, the plan would give up and go strait for the 'Primary Key', and this is going to be a lot faster.
So make your own Id just as long as you can ensure it will be unique.
Is it possible for the same exact Mongo ObjectId to be generated for a document in two different collections? I realize that it's definitely very unlikely, but is it possible?
Without getting too specific, the reason I ask is that with an application that I'm working on we show public profiles of elected officials who we hope to convert into full fledged users of our site. We have separate collections for users and the elected officials who aren't currently members of our site. There are various other documents containing various pieces of data about the elected officials that all map back to the person using their elected official ObjectId.
After creating the account we still highlight the data that's associated to the elected official but they now also are a part of the users collection with a corresponding users ObjectId to map their profile to interactions with our application.
We had begun converting our application from MySql to Mongo a few months ago and while we're in transition we store the legacy MySql id for both of these data types and we're also starting to now store the elected official Mongo ObjectId in the users document to map back to the elected official data.
I was pondering just specifying the new user ObjectId as the previous elected official ObjectId to make things simpler but wanted to make sure that it wasn't possible to have a collision with any existing user ObjectId.
Thanks for your insight.
Edit: Shortly after posting this question, I realized that my proposed solution wasn't a very good idea. It would be better to just keep the current schema that we have in place and just link to the elected official '_id' in the users document.
Short Answer
Just to add a direct response to your initial question: YES, if you use BSON Object ID generation, then for most drivers the IDs are almost certainly going to be unique across collections. See below for what "almost certainly" means.
Long Answer
The BSON Object ID's generated by Mongo DB drivers are highly likely to be unique across collections. This is mainly because of the last 3 bytes of the ID, which for most drivers is generated via a static incrementing counter. That counter is collection-independent; it's global. The Java driver, for example, uses a randomly initialized, static AtomicInteger.
So why, in the Mongo docs, do they say that the IDs are "highly likely" to be unique, instead of outright saying that they WILL be unique? Three possibilities can occur where you won't get a unique ID (please let me know if there are more):
Before this discussion, recall that the BSON Object ID consists of:
[4 bytes seconds since epoch, 3 bytes machine hash, 2 bytes process ID, 3 bytes counter]
Here are the three possibilities, so you judge for yourself how likely it is to get a dupe:
1) Counter overflow: there are 3 bytes in the counter. If you happen to insert over 16,777,216 (2^24) documents in a single second, on the same machine, in the same process, then you may overflow the incrementing counter bytes and end up with two Object IDs that share the same time, machine, process, and counter values.
2) Counter non-incrementing: some Mongo drivers use random numbers instead of incrementing numbers for the counter bytes. In these cases, there is a 1/16,777,216 chance of generating a non-unique ID, but only if those two IDs are generated in the same second (i.e. before the time section of the ID updates to the next second), on the same machine, in the same process.
3) Machine and process hash to the same values. The machine ID and process ID values may, in some highly unlikely scenario, map to the same values for two different machines. If this occurs, and at the same time the two counters on the two different machines, during the same second, generate the same value, then you'll end up with a duplicate ID.
These are the three scenarios to watch out for. Scenario 1 and 3 seem highly unlikely, and scenario 2 is totally avoidable if you're using the right driver. You'll have to check the source of the driver to know for sure.
ObjectIds are generated client-side in a manner similar to UUID but with some nicer properties for storage in a database such as roughly increasing order and encoding their creation time for free. The key thing for your use case is that they are designed to guarantee uniqueness to a high probability even if they are generated on different machines.
Now if you were referring to the _id field in general, we do not require uniqueness across collections so it is safe to reuse the old _id. As a concrete example, if you have two collections, colors and fruits, both could simultaneously have an object like {_id: 'orange'}.
In case you want to know more about how ObjectIds are created, here is the spec: http://www.mongodb.org/display/DOCS/Object+IDs#ObjectIDs-BSONObjectIDSpecification
In case anyone is having problems with duplicate Mongo ObjectIDs, you should know that despite the unlikelihood of dups happening in Mongo itself, it is possible to have duplicate _id's generated with PHP in Mongo.
The use-case where this has happened with regularity for me is when I'm looping through a dataset and attempting to inject the data into a collection.
The array that holds the injection data must be explicitly reset on each iteration - even if you aren't specifying the _id value. For some reason, the INSERT process adds the Mongo _id to the array as if it were a global variable (even if the array doesn't have global scope). This can affect you even if you are calling the insertion in a separate function call where you would normally expect the values of the array not to persist back to the calling function.
There are three solutions to this:
You can unset() the _id field from the array
You can reinitialize the entire array with array() each time you loop through your dataset
You can explicitly define the _id value yourself (taking care to define it in such a way that you don't generate dups yourself).
My guess is that this is a bug in the PHP interface, and not so much an issue with Mongo, but if you run into this problem, just unset the _id and you should be fine.
There's no guarantee whatsoever about ObjectId uniqueness across collections. Even if it's probabilistically very unlikely, it would be a very poor application design that relied on _id uniqueness across collections.
One can easily test this in the mongo shell:
MongoDB shell version: 1.6.5
connecting to: test
> db.foo.insert({_id: 'abc'})
> db.bar.insert({_id: 'abc'})
> db.foo.find({_id: 'abc'})
{ "_id" : "abc" }
> db.bar.find({_id: 'abc'})
{ "_id" : "abc" }
> db.foo.insert({_id: 'abc', data:'xyz'})
E11000 duplicate key error index: test.foo.$_id_ dup key: { : "abc" }
So, absolutely don't rely on _id's being unique across collections, and since you don't control the ObjectId generation function, don't rely on it.
It's possible to create something that's more like a uuid, and if you do that manually, you could have some better guarantee of uniqueness.
Remember that you can put objects of different "types" in the same collection, so why not just put your two "tables" in the same collection. They would share the same _id space, and thus, would be guaranteed unique. Switching from "prospective" to "registered" would be a simple flipping of a field...
I'm building a very large counter system. To be clear, the system is counting the number of times a domain occurs in a stream of data (that's about 50 - 100 million elements in size).
The system will individually process each element and make a database request to increment a counter for that domain and the date it is processed on. Here's the structure:
stats_table (or collection)
-----------
id
domain (string)
date (date, YYYY-MM-DD)
count (integer)
My initial inkling was to use MongoDB because of their atomic counter feature. However as I thought about it more, I figured Postgres updates already occur atomically (at least that's what this question leads me to believe).
My question is this: is there any benefit of using one database over the other here? Assuming that I'll be processing around 5 million domains a day, what are the key things I need to be considering here?
All single operations in Postgres are automatically wrapped in transactions and all operations on a single document in MongoDB are atomic. Atomicity isn't really a reason to preference one database over the other in this case.
While the individual counts may get quite high, if you're only storing aggregate counts and not each instance of a count, the total number of records should not be too significant. Even if you're tracking millions of domains, either Mongo or Postgres will work equally well.
MongoDB is a good solution for logging events, but I find Postgres to be preferable if you want to do a lot of interesting, relational analysis on the analytics data you're collecting. To do so efficiently in Mongo often requires a high degree of denormalization, so I'd think more about how you plan to use the data in the future.
Is it possible for the same exact Mongo ObjectId to be generated for a document in two different collections? I realize that it's definitely very unlikely, but is it possible?
Without getting too specific, the reason I ask is that with an application that I'm working on we show public profiles of elected officials who we hope to convert into full fledged users of our site. We have separate collections for users and the elected officials who aren't currently members of our site. There are various other documents containing various pieces of data about the elected officials that all map back to the person using their elected official ObjectId.
After creating the account we still highlight the data that's associated to the elected official but they now also are a part of the users collection with a corresponding users ObjectId to map their profile to interactions with our application.
We had begun converting our application from MySql to Mongo a few months ago and while we're in transition we store the legacy MySql id for both of these data types and we're also starting to now store the elected official Mongo ObjectId in the users document to map back to the elected official data.
I was pondering just specifying the new user ObjectId as the previous elected official ObjectId to make things simpler but wanted to make sure that it wasn't possible to have a collision with any existing user ObjectId.
Thanks for your insight.
Edit: Shortly after posting this question, I realized that my proposed solution wasn't a very good idea. It would be better to just keep the current schema that we have in place and just link to the elected official '_id' in the users document.
Short Answer
Just to add a direct response to your initial question: YES, if you use BSON Object ID generation, then for most drivers the IDs are almost certainly going to be unique across collections. See below for what "almost certainly" means.
Long Answer
The BSON Object ID's generated by Mongo DB drivers are highly likely to be unique across collections. This is mainly because of the last 3 bytes of the ID, which for most drivers is generated via a static incrementing counter. That counter is collection-independent; it's global. The Java driver, for example, uses a randomly initialized, static AtomicInteger.
So why, in the Mongo docs, do they say that the IDs are "highly likely" to be unique, instead of outright saying that they WILL be unique? Three possibilities can occur where you won't get a unique ID (please let me know if there are more):
Before this discussion, recall that the BSON Object ID consists of:
[4 bytes seconds since epoch, 3 bytes machine hash, 2 bytes process ID, 3 bytes counter]
Here are the three possibilities, so you judge for yourself how likely it is to get a dupe:
1) Counter overflow: there are 3 bytes in the counter. If you happen to insert over 16,777,216 (2^24) documents in a single second, on the same machine, in the same process, then you may overflow the incrementing counter bytes and end up with two Object IDs that share the same time, machine, process, and counter values.
2) Counter non-incrementing: some Mongo drivers use random numbers instead of incrementing numbers for the counter bytes. In these cases, there is a 1/16,777,216 chance of generating a non-unique ID, but only if those two IDs are generated in the same second (i.e. before the time section of the ID updates to the next second), on the same machine, in the same process.
3) Machine and process hash to the same values. The machine ID and process ID values may, in some highly unlikely scenario, map to the same values for two different machines. If this occurs, and at the same time the two counters on the two different machines, during the same second, generate the same value, then you'll end up with a duplicate ID.
These are the three scenarios to watch out for. Scenario 1 and 3 seem highly unlikely, and scenario 2 is totally avoidable if you're using the right driver. You'll have to check the source of the driver to know for sure.
ObjectIds are generated client-side in a manner similar to UUID but with some nicer properties for storage in a database such as roughly increasing order and encoding their creation time for free. The key thing for your use case is that they are designed to guarantee uniqueness to a high probability even if they are generated on different machines.
Now if you were referring to the _id field in general, we do not require uniqueness across collections so it is safe to reuse the old _id. As a concrete example, if you have two collections, colors and fruits, both could simultaneously have an object like {_id: 'orange'}.
In case you want to know more about how ObjectIds are created, here is the spec: http://www.mongodb.org/display/DOCS/Object+IDs#ObjectIDs-BSONObjectIDSpecification
In case anyone is having problems with duplicate Mongo ObjectIDs, you should know that despite the unlikelihood of dups happening in Mongo itself, it is possible to have duplicate _id's generated with PHP in Mongo.
The use-case where this has happened with regularity for me is when I'm looping through a dataset and attempting to inject the data into a collection.
The array that holds the injection data must be explicitly reset on each iteration - even if you aren't specifying the _id value. For some reason, the INSERT process adds the Mongo _id to the array as if it were a global variable (even if the array doesn't have global scope). This can affect you even if you are calling the insertion in a separate function call where you would normally expect the values of the array not to persist back to the calling function.
There are three solutions to this:
You can unset() the _id field from the array
You can reinitialize the entire array with array() each time you loop through your dataset
You can explicitly define the _id value yourself (taking care to define it in such a way that you don't generate dups yourself).
My guess is that this is a bug in the PHP interface, and not so much an issue with Mongo, but if you run into this problem, just unset the _id and you should be fine.
There's no guarantee whatsoever about ObjectId uniqueness across collections. Even if it's probabilistically very unlikely, it would be a very poor application design that relied on _id uniqueness across collections.
One can easily test this in the mongo shell:
MongoDB shell version: 1.6.5
connecting to: test
> db.foo.insert({_id: 'abc'})
> db.bar.insert({_id: 'abc'})
> db.foo.find({_id: 'abc'})
{ "_id" : "abc" }
> db.bar.find({_id: 'abc'})
{ "_id" : "abc" }
> db.foo.insert({_id: 'abc', data:'xyz'})
E11000 duplicate key error index: test.foo.$_id_ dup key: { : "abc" }
So, absolutely don't rely on _id's being unique across collections, and since you don't control the ObjectId generation function, don't rely on it.
It's possible to create something that's more like a uuid, and if you do that manually, you could have some better guarantee of uniqueness.
Remember that you can put objects of different "types" in the same collection, so why not just put your two "tables" in the same collection. They would share the same _id space, and thus, would be guaranteed unique. Switching from "prospective" to "registered" would be a simple flipping of a field...