When inserting a document into a MongoDB and no _id is assigned explicitly, one is assigned automatically. My question is, where is this assignment made and where is the ID generated? Is it generated by the client, before sending the insert request, or is it on the server side?
The context of my question is that I want to use MongoDB to build an "event store" (in the sense of "event sourcing"). Part of that is that the store enforces an ordering on the events. There is already an internal ordering in MongoDB, which is sufficient. However, at some point I may have to resume reading events at some point. For that, I can't use the last processed ID in the filter expression, because the process-unique and random parts of the ID don't guarantee any ordering. However, the timestamp part of the ID could do that, if only it was guaranteed to rise monotonically.
If the ID is generated by the server (and that server doesn't do anything funny like having a backward jumping system clock), then there is only one place that generates these IDs. Differences in system clocks and latency between different systems becomes irrelevant. Subsequently, I can rely on the timestamp part of the ID increasing monotonically.
All clients (e.g., Mongo Shell, a Python, Java or NodeJS application) connects to the MongoDB database server via a driver software. If a value is not supplied by the application to the _id field the driver generates one. (NOTE: I believe, not sure, if the driver fails to assign one, the database server assigns the value for the _id field). The default _id field value is of BSON type ObjectId.
(1) According to MongoDB docs the ObjectId is generated by clients:
IMPORTANT
While ObjectId values should increase over time, they are not
necessarily monotonic. This is because they:
Only contain one second of temporal resolution, so ObjectId values created within the same second do not have a guaranteed ordering, and
Are generated by clients, which may have differing system clocks.
(2) So, how to ensure your ObjectId is unique?
See this MongoDB blog article Generating Globally Unique Identifiers for Use with MongoDB topics:
Ensure identifier uniqueness at the database level
Use ObjectID as a unique identifier
Related
Is it possible for the same exact Mongo ObjectId to be generated for a document in two different collections? I realize that it's definitely very unlikely, but is it possible?
Without getting too specific, the reason I ask is that with an application that I'm working on we show public profiles of elected officials who we hope to convert into full fledged users of our site. We have separate collections for users and the elected officials who aren't currently members of our site. There are various other documents containing various pieces of data about the elected officials that all map back to the person using their elected official ObjectId.
After creating the account we still highlight the data that's associated to the elected official but they now also are a part of the users collection with a corresponding users ObjectId to map their profile to interactions with our application.
We had begun converting our application from MySql to Mongo a few months ago and while we're in transition we store the legacy MySql id for both of these data types and we're also starting to now store the elected official Mongo ObjectId in the users document to map back to the elected official data.
I was pondering just specifying the new user ObjectId as the previous elected official ObjectId to make things simpler but wanted to make sure that it wasn't possible to have a collision with any existing user ObjectId.
Thanks for your insight.
Edit: Shortly after posting this question, I realized that my proposed solution wasn't a very good idea. It would be better to just keep the current schema that we have in place and just link to the elected official '_id' in the users document.
Short Answer
Just to add a direct response to your initial question: YES, if you use BSON Object ID generation, then for most drivers the IDs are almost certainly going to be unique across collections. See below for what "almost certainly" means.
Long Answer
The BSON Object ID's generated by Mongo DB drivers are highly likely to be unique across collections. This is mainly because of the last 3 bytes of the ID, which for most drivers is generated via a static incrementing counter. That counter is collection-independent; it's global. The Java driver, for example, uses a randomly initialized, static AtomicInteger.
So why, in the Mongo docs, do they say that the IDs are "highly likely" to be unique, instead of outright saying that they WILL be unique? Three possibilities can occur where you won't get a unique ID (please let me know if there are more):
Before this discussion, recall that the BSON Object ID consists of:
[4 bytes seconds since epoch, 3 bytes machine hash, 2 bytes process ID, 3 bytes counter]
Here are the three possibilities, so you judge for yourself how likely it is to get a dupe:
1) Counter overflow: there are 3 bytes in the counter. If you happen to insert over 16,777,216 (2^24) documents in a single second, on the same machine, in the same process, then you may overflow the incrementing counter bytes and end up with two Object IDs that share the same time, machine, process, and counter values.
2) Counter non-incrementing: some Mongo drivers use random numbers instead of incrementing numbers for the counter bytes. In these cases, there is a 1/16,777,216 chance of generating a non-unique ID, but only if those two IDs are generated in the same second (i.e. before the time section of the ID updates to the next second), on the same machine, in the same process.
3) Machine and process hash to the same values. The machine ID and process ID values may, in some highly unlikely scenario, map to the same values for two different machines. If this occurs, and at the same time the two counters on the two different machines, during the same second, generate the same value, then you'll end up with a duplicate ID.
These are the three scenarios to watch out for. Scenario 1 and 3 seem highly unlikely, and scenario 2 is totally avoidable if you're using the right driver. You'll have to check the source of the driver to know for sure.
ObjectIds are generated client-side in a manner similar to UUID but with some nicer properties for storage in a database such as roughly increasing order and encoding their creation time for free. The key thing for your use case is that they are designed to guarantee uniqueness to a high probability even if they are generated on different machines.
Now if you were referring to the _id field in general, we do not require uniqueness across collections so it is safe to reuse the old _id. As a concrete example, if you have two collections, colors and fruits, both could simultaneously have an object like {_id: 'orange'}.
In case you want to know more about how ObjectIds are created, here is the spec: http://www.mongodb.org/display/DOCS/Object+IDs#ObjectIDs-BSONObjectIDSpecification
In case anyone is having problems with duplicate Mongo ObjectIDs, you should know that despite the unlikelihood of dups happening in Mongo itself, it is possible to have duplicate _id's generated with PHP in Mongo.
The use-case where this has happened with regularity for me is when I'm looping through a dataset and attempting to inject the data into a collection.
The array that holds the injection data must be explicitly reset on each iteration - even if you aren't specifying the _id value. For some reason, the INSERT process adds the Mongo _id to the array as if it were a global variable (even if the array doesn't have global scope). This can affect you even if you are calling the insertion in a separate function call where you would normally expect the values of the array not to persist back to the calling function.
There are three solutions to this:
You can unset() the _id field from the array
You can reinitialize the entire array with array() each time you loop through your dataset
You can explicitly define the _id value yourself (taking care to define it in such a way that you don't generate dups yourself).
My guess is that this is a bug in the PHP interface, and not so much an issue with Mongo, but if you run into this problem, just unset the _id and you should be fine.
There's no guarantee whatsoever about ObjectId uniqueness across collections. Even if it's probabilistically very unlikely, it would be a very poor application design that relied on _id uniqueness across collections.
One can easily test this in the mongo shell:
MongoDB shell version: 1.6.5
connecting to: test
> db.foo.insert({_id: 'abc'})
> db.bar.insert({_id: 'abc'})
> db.foo.find({_id: 'abc'})
{ "_id" : "abc" }
> db.bar.find({_id: 'abc'})
{ "_id" : "abc" }
> db.foo.insert({_id: 'abc', data:'xyz'})
E11000 duplicate key error index: test.foo.$_id_ dup key: { : "abc" }
So, absolutely don't rely on _id's being unique across collections, and since you don't control the ObjectId generation function, don't rely on it.
It's possible to create something that's more like a uuid, and if you do that manually, you could have some better guarantee of uniqueness.
Remember that you can put objects of different "types" in the same collection, so why not just put your two "tables" in the same collection. They would share the same _id space, and thus, would be guaranteed unique. Switching from "prospective" to "registered" would be a simple flipping of a field...
I would like to know if publishing _id of a document is safe.
I am using an analytics software to track behaviors of users, and I need to access _id on client for better context. However, it vexes me that I am publishing an internal information of a document.
All in all, being new to mongo and Meteor, I would like to make sure if this is safe. Any suggestions?
If the document's _id is created using Meteor the document's _id is at best fully random.
It doesn't contain any information besides a reference to the document itself.
If you're publishing the document this should no reveal any further information.
Even when Meteor uses an ObjectID (Meteor generated) the timestamp and other identifiers are random too. The timestamp component is also random, as mentioned in the Meteor docs (http://docs.meteor.com/#/full/mongo_object_id)
If an _id is used that is an ObjectId generated by MongoDB externally, outside of Meteor contains information such as a timestamp and details about your server. But this should not be an issue if your app is a typical Meteor app.
If you're using the default ObjectId implementation, here's what you'd be exposing:
ObjectId is a 12-byte BSON type, constructed using:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
There are some slight concerns here:
Knowing the time, machine ID, process ID and a previous counter lets an attacker plausibly guess other ObjectIds which might be used to obtain other rows from your database (if you enable any form of access via the ID, which seems possible given that you want the client to know the _id).
Knowing the machine identifier might allow an attacker to identify particular database servers (only relevant if they already have access to your network)
Knowing the PID might allow an attacker to figure out the uptime of your DB server (watching for when the PID changes)
Knowing the time gives the attacker the creation time of the document (only valuable if the attacker didn't already know that)
These are all fairly minor leaks in the grand scheme of things, although I usually lean towards avoiding unnecessary information leakage where possible. Consider using a non-default _id column (e.g. using a randomly-generated value), and whether you really need the plaintext _id column to be visible (maybe you can use a cryptographic hash of it instead, or perhaps you can encrypt it so that the client only sees an encrypted version).
It's quite safe. As other answers correctly stated, yes, it may contain some bits of information about your machine and server process (depending on where the id was generated, in meteor or mongodb server)
But to leverage them for any malicious actions is a tough task. An attacker pretty much needs to get inside your network to do anything. And if he's in your network, you're screwed anyway, regardless of what ID format you expose.
By any measure, ObjectId is more resilient to guesses than a regular auto-incrementing integer ids you find in other databases. And they're exposed all the time!
Note, though, that you shouldn't rely on the fact that an id is hard to guess. Protect private pages with authorization checks, etc. So that even if an attacker correctly guesses an id of a page he shouldn't have access to, he is refused the access.
Exposing document reference(_id is not only data , it is an reference, primary key) is not at all safe.
_id is 12 bytes which consist of
4 byte - timestamp
3 byte - machine id
2 byte - process id
3 byte - increment counter
There are possible way to find out all the details about document and collection of the database using _id.
Note: Exposing _id is not at all safe.
Scenario
Suppose your are using _id for some socket listener or emitter, Anyone can guess other _id using that particular exposed _id and emit some malicious data.
When you expose _id, It contains timestamp, so you are exposing the time of creation of that particular data.
Is it possible for the same exact Mongo ObjectId to be generated for a document in two different collections? I realize that it's definitely very unlikely, but is it possible?
Without getting too specific, the reason I ask is that with an application that I'm working on we show public profiles of elected officials who we hope to convert into full fledged users of our site. We have separate collections for users and the elected officials who aren't currently members of our site. There are various other documents containing various pieces of data about the elected officials that all map back to the person using their elected official ObjectId.
After creating the account we still highlight the data that's associated to the elected official but they now also are a part of the users collection with a corresponding users ObjectId to map their profile to interactions with our application.
We had begun converting our application from MySql to Mongo a few months ago and while we're in transition we store the legacy MySql id for both of these data types and we're also starting to now store the elected official Mongo ObjectId in the users document to map back to the elected official data.
I was pondering just specifying the new user ObjectId as the previous elected official ObjectId to make things simpler but wanted to make sure that it wasn't possible to have a collision with any existing user ObjectId.
Thanks for your insight.
Edit: Shortly after posting this question, I realized that my proposed solution wasn't a very good idea. It would be better to just keep the current schema that we have in place and just link to the elected official '_id' in the users document.
Short Answer
Just to add a direct response to your initial question: YES, if you use BSON Object ID generation, then for most drivers the IDs are almost certainly going to be unique across collections. See below for what "almost certainly" means.
Long Answer
The BSON Object ID's generated by Mongo DB drivers are highly likely to be unique across collections. This is mainly because of the last 3 bytes of the ID, which for most drivers is generated via a static incrementing counter. That counter is collection-independent; it's global. The Java driver, for example, uses a randomly initialized, static AtomicInteger.
So why, in the Mongo docs, do they say that the IDs are "highly likely" to be unique, instead of outright saying that they WILL be unique? Three possibilities can occur where you won't get a unique ID (please let me know if there are more):
Before this discussion, recall that the BSON Object ID consists of:
[4 bytes seconds since epoch, 3 bytes machine hash, 2 bytes process ID, 3 bytes counter]
Here are the three possibilities, so you judge for yourself how likely it is to get a dupe:
1) Counter overflow: there are 3 bytes in the counter. If you happen to insert over 16,777,216 (2^24) documents in a single second, on the same machine, in the same process, then you may overflow the incrementing counter bytes and end up with two Object IDs that share the same time, machine, process, and counter values.
2) Counter non-incrementing: some Mongo drivers use random numbers instead of incrementing numbers for the counter bytes. In these cases, there is a 1/16,777,216 chance of generating a non-unique ID, but only if those two IDs are generated in the same second (i.e. before the time section of the ID updates to the next second), on the same machine, in the same process.
3) Machine and process hash to the same values. The machine ID and process ID values may, in some highly unlikely scenario, map to the same values for two different machines. If this occurs, and at the same time the two counters on the two different machines, during the same second, generate the same value, then you'll end up with a duplicate ID.
These are the three scenarios to watch out for. Scenario 1 and 3 seem highly unlikely, and scenario 2 is totally avoidable if you're using the right driver. You'll have to check the source of the driver to know for sure.
ObjectIds are generated client-side in a manner similar to UUID but with some nicer properties for storage in a database such as roughly increasing order and encoding their creation time for free. The key thing for your use case is that they are designed to guarantee uniqueness to a high probability even if they are generated on different machines.
Now if you were referring to the _id field in general, we do not require uniqueness across collections so it is safe to reuse the old _id. As a concrete example, if you have two collections, colors and fruits, both could simultaneously have an object like {_id: 'orange'}.
In case you want to know more about how ObjectIds are created, here is the spec: http://www.mongodb.org/display/DOCS/Object+IDs#ObjectIDs-BSONObjectIDSpecification
In case anyone is having problems with duplicate Mongo ObjectIDs, you should know that despite the unlikelihood of dups happening in Mongo itself, it is possible to have duplicate _id's generated with PHP in Mongo.
The use-case where this has happened with regularity for me is when I'm looping through a dataset and attempting to inject the data into a collection.
The array that holds the injection data must be explicitly reset on each iteration - even if you aren't specifying the _id value. For some reason, the INSERT process adds the Mongo _id to the array as if it were a global variable (even if the array doesn't have global scope). This can affect you even if you are calling the insertion in a separate function call where you would normally expect the values of the array not to persist back to the calling function.
There are three solutions to this:
You can unset() the _id field from the array
You can reinitialize the entire array with array() each time you loop through your dataset
You can explicitly define the _id value yourself (taking care to define it in such a way that you don't generate dups yourself).
My guess is that this is a bug in the PHP interface, and not so much an issue with Mongo, but if you run into this problem, just unset the _id and you should be fine.
There's no guarantee whatsoever about ObjectId uniqueness across collections. Even if it's probabilistically very unlikely, it would be a very poor application design that relied on _id uniqueness across collections.
One can easily test this in the mongo shell:
MongoDB shell version: 1.6.5
connecting to: test
> db.foo.insert({_id: 'abc'})
> db.bar.insert({_id: 'abc'})
> db.foo.find({_id: 'abc'})
{ "_id" : "abc" }
> db.bar.find({_id: 'abc'})
{ "_id" : "abc" }
> db.foo.insert({_id: 'abc', data:'xyz'})
E11000 duplicate key error index: test.foo.$_id_ dup key: { : "abc" }
So, absolutely don't rely on _id's being unique across collections, and since you don't control the ObjectId generation function, don't rely on it.
It's possible to create something that's more like a uuid, and if you do that manually, you could have some better guarantee of uniqueness.
Remember that you can put objects of different "types" in the same collection, so why not just put your two "tables" in the same collection. They would share the same _id space, and thus, would be guaranteed unique. Switching from "prospective" to "registered" would be a simple flipping of a field...
When storing events in an event store, the order in which the events are stored is very important especially when projecting the events later to restore an entities current state.
MongoDB seems to be a good choice for persisting the event store, given its speed and flexibel schema (and it is often recommended as such) but there is no such thing as a transaction in MongoDB meaning the correct event order can not be garanteed.
Given that fact, should you not use MongoDB if you are looking for a consistent event store but rather stick with a conventional RDMS, or is there a way around this problem?
I'm not familiar with the term "event store" as you are using it, but I can address some of the issues in your question. I believe it is probably reasonable to use MongoDB for what you want, with a little bit of care.
In MongoDB, each document has an _id field which is by default in ObjectId format, which consists of a server identifier, and then a timestamp and then a sequence counter. So you can sort on that field and you'll get your objects in their creation order, provided the ObjectIds are all created on the same machine.
Most MongoDB client drivers create the _id field locally before sending an insert command to the database. So if you have multiple clients connecting to the database, sorting by _id won't do what you want since it will sort first by server-hash, which is not what you want.
But if you can convince your MongoDB client driver to not include the _id in the insert command, then the server will generate the ObjectId for each document and they will have the properties you want. Doing this will depend on what language you're working in since each language has its own client driver. Read the driver docs carefully or dive into their source code -- they're all open source. Most drivers also include a way to send a raw command to the server. So if you construct an insert command by hand this will certainly allow you to do what you want.
This will break down if your system is so massive that a single database server can't handle all of your write traffic. The MongoDB solution to needing to write thousands of records per second is to set up a sharded database. In this case the ObjectIds will again be created by different machines and won't have the nice sorting property you want. If you're concerned about outgrowing a single server for writes, you should look to another technology that provides distributed sequence numbers.
Is it possible for the same exact Mongo ObjectId to be generated for a document in two different collections? I realize that it's definitely very unlikely, but is it possible?
Without getting too specific, the reason I ask is that with an application that I'm working on we show public profiles of elected officials who we hope to convert into full fledged users of our site. We have separate collections for users and the elected officials who aren't currently members of our site. There are various other documents containing various pieces of data about the elected officials that all map back to the person using their elected official ObjectId.
After creating the account we still highlight the data that's associated to the elected official but they now also are a part of the users collection with a corresponding users ObjectId to map their profile to interactions with our application.
We had begun converting our application from MySql to Mongo a few months ago and while we're in transition we store the legacy MySql id for both of these data types and we're also starting to now store the elected official Mongo ObjectId in the users document to map back to the elected official data.
I was pondering just specifying the new user ObjectId as the previous elected official ObjectId to make things simpler but wanted to make sure that it wasn't possible to have a collision with any existing user ObjectId.
Thanks for your insight.
Edit: Shortly after posting this question, I realized that my proposed solution wasn't a very good idea. It would be better to just keep the current schema that we have in place and just link to the elected official '_id' in the users document.
Short Answer
Just to add a direct response to your initial question: YES, if you use BSON Object ID generation, then for most drivers the IDs are almost certainly going to be unique across collections. See below for what "almost certainly" means.
Long Answer
The BSON Object ID's generated by Mongo DB drivers are highly likely to be unique across collections. This is mainly because of the last 3 bytes of the ID, which for most drivers is generated via a static incrementing counter. That counter is collection-independent; it's global. The Java driver, for example, uses a randomly initialized, static AtomicInteger.
So why, in the Mongo docs, do they say that the IDs are "highly likely" to be unique, instead of outright saying that they WILL be unique? Three possibilities can occur where you won't get a unique ID (please let me know if there are more):
Before this discussion, recall that the BSON Object ID consists of:
[4 bytes seconds since epoch, 3 bytes machine hash, 2 bytes process ID, 3 bytes counter]
Here are the three possibilities, so you judge for yourself how likely it is to get a dupe:
1) Counter overflow: there are 3 bytes in the counter. If you happen to insert over 16,777,216 (2^24) documents in a single second, on the same machine, in the same process, then you may overflow the incrementing counter bytes and end up with two Object IDs that share the same time, machine, process, and counter values.
2) Counter non-incrementing: some Mongo drivers use random numbers instead of incrementing numbers for the counter bytes. In these cases, there is a 1/16,777,216 chance of generating a non-unique ID, but only if those two IDs are generated in the same second (i.e. before the time section of the ID updates to the next second), on the same machine, in the same process.
3) Machine and process hash to the same values. The machine ID and process ID values may, in some highly unlikely scenario, map to the same values for two different machines. If this occurs, and at the same time the two counters on the two different machines, during the same second, generate the same value, then you'll end up with a duplicate ID.
These are the three scenarios to watch out for. Scenario 1 and 3 seem highly unlikely, and scenario 2 is totally avoidable if you're using the right driver. You'll have to check the source of the driver to know for sure.
ObjectIds are generated client-side in a manner similar to UUID but with some nicer properties for storage in a database such as roughly increasing order and encoding their creation time for free. The key thing for your use case is that they are designed to guarantee uniqueness to a high probability even if they are generated on different machines.
Now if you were referring to the _id field in general, we do not require uniqueness across collections so it is safe to reuse the old _id. As a concrete example, if you have two collections, colors and fruits, both could simultaneously have an object like {_id: 'orange'}.
In case you want to know more about how ObjectIds are created, here is the spec: http://www.mongodb.org/display/DOCS/Object+IDs#ObjectIDs-BSONObjectIDSpecification
In case anyone is having problems with duplicate Mongo ObjectIDs, you should know that despite the unlikelihood of dups happening in Mongo itself, it is possible to have duplicate _id's generated with PHP in Mongo.
The use-case where this has happened with regularity for me is when I'm looping through a dataset and attempting to inject the data into a collection.
The array that holds the injection data must be explicitly reset on each iteration - even if you aren't specifying the _id value. For some reason, the INSERT process adds the Mongo _id to the array as if it were a global variable (even if the array doesn't have global scope). This can affect you even if you are calling the insertion in a separate function call where you would normally expect the values of the array not to persist back to the calling function.
There are three solutions to this:
You can unset() the _id field from the array
You can reinitialize the entire array with array() each time you loop through your dataset
You can explicitly define the _id value yourself (taking care to define it in such a way that you don't generate dups yourself).
My guess is that this is a bug in the PHP interface, and not so much an issue with Mongo, but if you run into this problem, just unset the _id and you should be fine.
There's no guarantee whatsoever about ObjectId uniqueness across collections. Even if it's probabilistically very unlikely, it would be a very poor application design that relied on _id uniqueness across collections.
One can easily test this in the mongo shell:
MongoDB shell version: 1.6.5
connecting to: test
> db.foo.insert({_id: 'abc'})
> db.bar.insert({_id: 'abc'})
> db.foo.find({_id: 'abc'})
{ "_id" : "abc" }
> db.bar.find({_id: 'abc'})
{ "_id" : "abc" }
> db.foo.insert({_id: 'abc', data:'xyz'})
E11000 duplicate key error index: test.foo.$_id_ dup key: { : "abc" }
So, absolutely don't rely on _id's being unique across collections, and since you don't control the ObjectId generation function, don't rely on it.
It's possible to create something that's more like a uuid, and if you do that manually, you could have some better guarantee of uniqueness.
Remember that you can put objects of different "types" in the same collection, so why not just put your two "tables" in the same collection. They would share the same _id space, and thus, would be guaranteed unique. Switching from "prospective" to "registered" would be a simple flipping of a field...