I would like to know if publishing _id of a document is safe.
I am using an analytics software to track behaviors of users, and I need to access _id on client for better context. However, it vexes me that I am publishing an internal information of a document.
All in all, being new to mongo and Meteor, I would like to make sure if this is safe. Any suggestions?
If the document's _id is created using Meteor the document's _id is at best fully random.
It doesn't contain any information besides a reference to the document itself.
If you're publishing the document this should no reveal any further information.
Even when Meteor uses an ObjectID (Meteor generated) the timestamp and other identifiers are random too. The timestamp component is also random, as mentioned in the Meteor docs (http://docs.meteor.com/#/full/mongo_object_id)
If an _id is used that is an ObjectId generated by MongoDB externally, outside of Meteor contains information such as a timestamp and details about your server. But this should not be an issue if your app is a typical Meteor app.
If you're using the default ObjectId implementation, here's what you'd be exposing:
ObjectId is a 12-byte BSON type, constructed using:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
There are some slight concerns here:
Knowing the time, machine ID, process ID and a previous counter lets an attacker plausibly guess other ObjectIds which might be used to obtain other rows from your database (if you enable any form of access via the ID, which seems possible given that you want the client to know the _id).
Knowing the machine identifier might allow an attacker to identify particular database servers (only relevant if they already have access to your network)
Knowing the PID might allow an attacker to figure out the uptime of your DB server (watching for when the PID changes)
Knowing the time gives the attacker the creation time of the document (only valuable if the attacker didn't already know that)
These are all fairly minor leaks in the grand scheme of things, although I usually lean towards avoiding unnecessary information leakage where possible. Consider using a non-default _id column (e.g. using a randomly-generated value), and whether you really need the plaintext _id column to be visible (maybe you can use a cryptographic hash of it instead, or perhaps you can encrypt it so that the client only sees an encrypted version).
It's quite safe. As other answers correctly stated, yes, it may contain some bits of information about your machine and server process (depending on where the id was generated, in meteor or mongodb server)
But to leverage them for any malicious actions is a tough task. An attacker pretty much needs to get inside your network to do anything. And if he's in your network, you're screwed anyway, regardless of what ID format you expose.
By any measure, ObjectId is more resilient to guesses than a regular auto-incrementing integer ids you find in other databases. And they're exposed all the time!
Note, though, that you shouldn't rely on the fact that an id is hard to guess. Protect private pages with authorization checks, etc. So that even if an attacker correctly guesses an id of a page he shouldn't have access to, he is refused the access.
Exposing document reference(_id is not only data , it is an reference, primary key) is not at all safe.
_id is 12 bytes which consist of
4 byte - timestamp
3 byte - machine id
2 byte - process id
3 byte - increment counter
There are possible way to find out all the details about document and collection of the database using _id.
Note: Exposing _id is not at all safe.
Scenario
Suppose your are using _id for some socket listener or emitter, Anyone can guess other _id using that particular exposed _id and emit some malicious data.
When you expose _id, It contains timestamp, so you are exposing the time of creation of that particular data.
Related
When inserting a document into a MongoDB and no _id is assigned explicitly, one is assigned automatically. My question is, where is this assignment made and where is the ID generated? Is it generated by the client, before sending the insert request, or is it on the server side?
The context of my question is that I want to use MongoDB to build an "event store" (in the sense of "event sourcing"). Part of that is that the store enforces an ordering on the events. There is already an internal ordering in MongoDB, which is sufficient. However, at some point I may have to resume reading events at some point. For that, I can't use the last processed ID in the filter expression, because the process-unique and random parts of the ID don't guarantee any ordering. However, the timestamp part of the ID could do that, if only it was guaranteed to rise monotonically.
If the ID is generated by the server (and that server doesn't do anything funny like having a backward jumping system clock), then there is only one place that generates these IDs. Differences in system clocks and latency between different systems becomes irrelevant. Subsequently, I can rely on the timestamp part of the ID increasing monotonically.
All clients (e.g., Mongo Shell, a Python, Java or NodeJS application) connects to the MongoDB database server via a driver software. If a value is not supplied by the application to the _id field the driver generates one. (NOTE: I believe, not sure, if the driver fails to assign one, the database server assigns the value for the _id field). The default _id field value is of BSON type ObjectId.
(1) According to MongoDB docs the ObjectId is generated by clients:
IMPORTANT
While ObjectId values should increase over time, they are not
necessarily monotonic. This is because they:
Only contain one second of temporal resolution, so ObjectId values created within the same second do not have a guaranteed ordering, and
Are generated by clients, which may have differing system clocks.
(2) So, how to ensure your ObjectId is unique?
See this MongoDB blog article Generating Globally Unique Identifiers for Use with MongoDB topics:
Ensure identifier uniqueness at the database level
Use ObjectID as a unique identifier
In my app I'm letting mongo generate order id's via its ObjectId method.
But in user testing we've had some concerns that the order id's are humanly 'intimidating', i.e. if you need to discuss your order with someone over the telephone, reading out 24 alphanumeric characters is a bit tedious.
At the same time, I don't really want to have to store two different id's, one 'human-accessible' and one used by mongo internally.
So my question is this - is there a way to choose a substring of length 6 or even 8 of the mongo objectId string that I could be fairly sure would be unique ?
For example if I have a mongo objectid like this
id = '4b28dcb61083ed3c809e0416'
maybe I could take out
human_id = id.substr(0,7);
and be sure that i'd always get unique id's for my orders...
The advantage of course is that these are orders, and so are human-created, and so there aren't millions of them per millisecond. On the other hand, it would really be a problem if two orders had the same shortened id...
--- clearer explanation ---
I guess a better way to ask my question would be this :
If I decide for example to just use the last 6 characters of a mongo id, is there some kind of measure of 'probability' that just these 6 characters would repeat in a given week ?
Given a certain number of mongo's running in parallel, a certain number of users during the week, etc.
If you have multiple web servers, with multiple processes, then there really isn't something you can remove with losing uniqueness.
If you look at the nature of the ObjectId:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
You'll see there's not much there that you could safely remove. As the first 4 bytes are time, it would be challenging to implement an algorithm that removed portions of the time stamp in a clean and safe way.
The machine identifier and process identifier are used in cases where there are multiple servers and/or processes acting as clients to the database server. If you dropped either of those, you could end up with duplicates again. The random value as the last 3 bytes is used to make sure that two identifiers, on the same machine, within the same process are unique, even when requested frequently.
If you were using it as an order id, and you want assured uniqueness, I wouldn't trim anything away from the 12 byte number as it was carefully designed to provide a robust and efficient distributed mechanism for generating unique numbers when there are many connected database clients.
If you took the last 5 characters of the ObjectId ..., and in a given period, what's the probability of conflict?
process id
counter
The probability of conflict is high. The process id may remain the same through the entire period, and the other number is just an incrementing number that would repeat after 4095 orders. But, if the process recycles, then you also have the chance that there will be a conflict with older orders, etc. And if you're talking multiple database clients, the chances increase as well. I just wouldn't try to trim the number. It's not worth the unhappy customers trying to place orders.
Even the timestamp and the random seed value aren't sufficient when there are multiple database clients generating ObjectIds. As you start to look at the various pieces, especially in the context of a farm of database clients, you should see why the pieces are there, and why removing them could lead to a meltdown in ObjectId generation.
I'd suggest you implement an algorithm to create a unique number and store it in the database. It's simple enough to do. It does impact performance a bit, but it's safe.
I wrote this answer a while ago about the challenges of using an ObjectId in a Url. It includes a link to how to create a unique auto incrementing number using MongoDB.
Actually what you choose for and Id (actually _id in MongoDB storage) is totally up to you. If there is some useful data you can keep in _id as long as you keep it unique, then do so. If it has to be something valid to url encoding, then do so.
By default, if you do not specify an _id then that field will be populated with the value you have come to love and hate. But if you explicitly use it, then you will get what you want.
The extra thing to keep in mind is that even if you specify an addtional unique index field, let's say order_id then MongoDB would actually have to check through that and other indexes on a query plan to see which one was best to use. But if _id was your key, the plan would give up and go strait for the 'Primary Key', and this is going to be a lot faster.
So make your own Id just as long as you can ensure it will be unique.
I'm using mongoDB for the first time in a RESTful service. Previously the id column in my SQL databases was an incrementing integer so my RESTful endpoints would look something like /rest/objectType/1. Is there any reason why I shouldn't just use mongoDB's ObjectId's in the same role, or is it wiser to maintain a separate incrementing integer id column and use this for urls?
Having used ObjectIds in RESTful APIs several times, the biggest downside is really that they are very noisy in terms of having a clean URL. You'll either leave it as a HEX number, or convert it to a very large integer number, both making for a somewhat unfriendly URL:
/rest/resource/52435dbecb970072ec3a780f
/rest/resource/25459211534898951476729247759
I've added a "title" to the URL (like StackOverflow does) to make them slightly more friendly:
/rest/resource/52435dbecb970072ec3a780f/FriendlyResourceName
Of course, the "title" is ignored in software, but the user sees it and can mentally ignore the crazy ID segment.
There's very little useful that could be learned from the infrastructure by exposing them:
Timestamp
Machine ID
Process ID
Random incrementing value
Other than potentially gathering Machine IDs (which generally would indicate the number of clients creating ObjectIds), there's not much there.
ObjectIds aren't random, so you couldn't use them for security. You'll always need to secure the data. While they may not increment in an obvious way, it would be easy to find other resources through brute force. However, if you were using auto-incrementing IDs before, this isn't a new problem for you.
If you know you aren't creating many new documents at any given time, it might be worth using one of the patterns here to create a simpler ID. In one app I wrote, I used an auto-inc technique for some of the document IDs that were shown in URLs, and for those that were Ajax-only, I used ObjectIds. I really wanted some URLs to be easily "typed". No form of an ObjectId is easily typed by an end user. That's one of the strengths of MongoDB -- that you can use any _id format you want. :)
It's wiser to use the ObjectIds, because keeping an incrementing counter can be a bottleneck. Also, since ObjectId contains a timestamp and is monotonic, they can be helpful in optimizing queries.
The ObjectIds can be guessed, but since that is definitely true for incrementing IDs, I suspect you didn't rely on security through obscurity before, so that's no trouble for you.
A downside, albeit a small one, is that the creation time on your server leaks to the user, i.e. if the user is able to identify this as an ObjectId, she can reverse-engineer the creation time of the object. That's the only potential issue I see.
The autogenerated BSON ID that is stored in the _id field of every document, is it a GUID?
The documentation says its 'most likely unique', so I am a little confused. Why would they use an id that is not guaranteed to be unique?
Its uniqness is based upon probability. Unlike #mattexx answer:
It's not "guaranteed" to be unique because MongoDB does not enforce uniqueness to save time.
MongoDB DOES enforce uniqness on the ObjectId, it in fact has a unique index on the _id field. When talking about saving time, the ObjectId is historical in that manner since it was designed in the days when MongoDB did not ack any writes and needed a 99% chance of being able to insert a new unique record without the client waiting for an ack (ObjectIds are generated client side).
They are not GUIDs however they, as #Asya says, are guaranteed to have a high level of uniqness.
So long as time never moves backwards there is still a 99% chance it will be unique forever. Okay, as #Devesh says, there is a, 1 in 1 trillion (? haven't done the math), chance of even a GUID being duplicated but, again, I do not think you will reach that probability anytime soon.
It is unique in most of the requirement and it is consists of timestamp , unique identifier of the machine (hash of the machine host) , process Identifier and in last the increment number. http://docs.mongodb.org/manual/reference/object-id/
The chance of an ID collision is theoretically close enough to zero that it can be presumed for typical web apps. Many real-world systems (Mongo or not) do rely on this property of GUIDs, though it would not be a good assumption for safety/mission-critical systems.
In practical terms, there are indeed scenarios where it could go wrong if there's a misconfiguration or third-party library bug. Those shouldn't rule out the concept, but important to be aware of those risks and avoid them where possible.
Some good analysis here of practical issues that could cause a collision. In particular:
Some Mongo drivers use random numbers instead of incrementing numbers for the counter bytes. In these cases, there is a 1/16,777,216 chance of generating a non-unique ID, but only if those two IDs are generated in the same second (i.e. before the time section of the ID updates to the next second), on the same machine, in the same process.
ObjectId is explained in the doc here. It's not "guaranteed" to be unique because MongoDB does not enforce uniqueness to save time. It simply trusts that the complicated generation algorithm will probably never produce two identical ObjectIds in the same datastore. So technically it is not a GUID, but pretty much as good.
When storing events in an event store, the order in which the events are stored is very important especially when projecting the events later to restore an entities current state.
MongoDB seems to be a good choice for persisting the event store, given its speed and flexibel schema (and it is often recommended as such) but there is no such thing as a transaction in MongoDB meaning the correct event order can not be garanteed.
Given that fact, should you not use MongoDB if you are looking for a consistent event store but rather stick with a conventional RDMS, or is there a way around this problem?
I'm not familiar with the term "event store" as you are using it, but I can address some of the issues in your question. I believe it is probably reasonable to use MongoDB for what you want, with a little bit of care.
In MongoDB, each document has an _id field which is by default in ObjectId format, which consists of a server identifier, and then a timestamp and then a sequence counter. So you can sort on that field and you'll get your objects in their creation order, provided the ObjectIds are all created on the same machine.
Most MongoDB client drivers create the _id field locally before sending an insert command to the database. So if you have multiple clients connecting to the database, sorting by _id won't do what you want since it will sort first by server-hash, which is not what you want.
But if you can convince your MongoDB client driver to not include the _id in the insert command, then the server will generate the ObjectId for each document and they will have the properties you want. Doing this will depend on what language you're working in since each language has its own client driver. Read the driver docs carefully or dive into their source code -- they're all open source. Most drivers also include a way to send a raw command to the server. So if you construct an insert command by hand this will certainly allow you to do what you want.
This will break down if your system is so massive that a single database server can't handle all of your write traffic. The MongoDB solution to needing to write thousands of records per second is to set up a sharded database. In this case the ObjectIds will again be created by different machines and won't have the nice sorting property you want. If you're concerned about outgrowing a single server for writes, you should look to another technology that provides distributed sequence numbers.