am I exposing sensitive data if I put a bson ID in a url? - mongodb

Say I have a Products array in my Mongodb. I'd like users to be able to see each product on their own page: http://www.mysite.com/product/12345/Widget-Wodget. Since each Product doesn't have an incremental integer ID (12345) but instead it has a BSON ID (5063a36bdeb13f7505000630), I'd need to either add the integer ID or use the BSON ID.
Since BSON ID's include the PID:
4-byte timestamp,
3-byte machine identifier,
2-byte process id,
3-byte counter.
Am I exposing secure information to the outside world if I use the BSON ID in my url?

I can't think of any use to gain privileges on your machines, however using ObjectIds everywhere discloses a lot of information nonetheless.
By crawling your website, one could:
find about some hidden objects: for instance, if the counter part goes from 0x....b1 to 0x....b9 between times t1 and t2, one can guess ObjectIds within these invervals. However, guessing ids is most likely useless if you enforce access permissions
know the signup date of each user (not very sensitive info but better than nothing)
deduce actual (as opposed to publicly available) business hours from the timestamps of objects created by the staff
deduce in which timezones your audience lives from the timestamps of user-generated objects: if your website is one which people use mostly at lunchtime, then one could measure peaks of ObjectIds and deduce that a peak at 8 PM UTC means the audience was on the US West coast
and more generally, by crawling most of your website, one can build a timeline of the success of your service, having for any given time knowledge of: your user count, levels of user engagement, how many servers you've got, how often your servers are restarted. PID changes occurring on weekends are more likely crashes, whereas those on business days are more likely crashes + software revisions
and probably find other info specific to your business processes and domain
To be fair, even with random ids one can infer a lot. The main issue is that you need to prevent anyone from scraping a statistically significant part of your site. But if someone is determined, they'll succeed eventually, which is why providing them with all of this extra, timestamped info seems wrong.

Sharing the information in the ObjectID will not compromise your security. Someone could infer minor details such as when the ObjectID was created (timestamp), but none of the ObjectID components should be tied to authentication or authorization.
If you are building an e-commerce site, SEO is typically a strong consideration for public URLs. In this case you normally want to use a friendlier URL with shorter and more semantic path components than an ObjectID.
Note that you do not have to use the default ObjectID for your _id field .. so could always generate something more relevant for your application. The default ObjectID does provide a reasonable guarantee of uniqueness, so if you implement your own _id allocation you will have to take this into consideration.
See also:
Create an Auto-Incrementing Sequence Field

As #Stennie said, not really.
Let's start with the pid, most hackers wouldn't bother looking for a pid, on say Linux, instead they would just do:
ps aux | grep mongod
or something similar. Of course this requires the hacker to have actually hacked your server, I know of no public hack available based on the pid alone. Considering the pid will change when you restart the machine or mongod, this information is utterly useless to anyone trying to spy.
The machine id is another bit of data that is quite useless publicly and, to be honest, they would get a better understanding of your network using ping or digg than they would through the machine id alone.
So to answer the question: No, there is no real security threat and the information you are displaying is of no use to anyone except MongoDB really.
I also agree with #Stennie on using SEO friendly URLs, an example which I commonly use for e-commerce is /product/product_title_ with a smaller random id (maybe base 64 encode the _id) or a auto incrementing id with .html on the end.

Related

To relate one record to another in MongoDB, is it ok to use a slug?

Let's say we have two models like this:
User:
_ _id
- name
- email
Company:
- _id
_ name
_ slug
Now let's say I need to connect a user to the company. A user can have one company assigned. To do this, I can add a new field called companyID in the user model. But I'm not sending the _id field to the front end. All the requests that come to the API will have the slug only. There are two ways I can do this:
1) Add slug to relate the company: If I do this, I can take the slug sent from a request and directly query for the company.
2) Add the _id of the company: If I do this, I need to first use the slug to query for the company and then use the _id returned to query for the required data.
May I please know which way is the best? Is there any extra benefit when using the _id of a record for the relationship?
Agree with the 2nd approach. There are several issues to consider when deciding on which field to use as a join key (this is true of all DBs, not just Mongo):
The field must be unique. I'm not sure exactly what the 'slug' field in your schema represents, but if there is any chance this could be duplicated, then don't use it.
The field must not change. Strictly speaking, you can change a key field but the only way to safely do so is to simultaneously change it in all the child tables atomically. This is a difficult thing to do reliably because a) you have to know which tables are using the field (maybe some other developer added another table that you're not aware of) b) If you do it one at a time, you'll introduce race conditions c) If any of the updates fail, you'll have inconsistent data and corrupted parent-child links. Some SQL DBs have a cascading-update feature to solve this problem, but Mongo does not. It's a hard enough problem that you really, really don't want to change a key field if you don't have to.
The field must be indexed. Strictly speaking this isn't true, but if you're going to join on it, then you will be running a lot of queries on it, so you'll need to index it.
For these reasons, it's almost always recommended to use a key field that serves solely as a key field, with no actual information stored in it. Plenty of people have been burned using things like Social Security Numbers, drivers licenses, etc. as key fields, either because there can be duplicates (e.g. SSNs can be duplicated if people are using fake numbers, or if they don't have one), or the numbers can change (e.g. drivers licenses).
Plus, by doing so, you can format the key field to optimize for speed of unique generation and indexing. For example, if you use SSNs, you need to check the SSN against the rest of the DB to ensure it's unique. That takes time if you have millions of records. Similarly for slugs, which are text fields that need to be hashed and checked against an index. OTOH, mongoDB essentially uses UUIDs as keys, which means it doesn't have to check for uniqueness (the algorithm guarantees a high statistical likelihood of uniqueness).
The bottomline is that there are very good reasons not to use a "real" field as your key if you can help it. Fortunately for you, mongoDB already gives you a great key field which satisfies all the above criteria, the _id field. Therefore, you should use it. Even if slug is not a "real" field and you generate it the exact same way as an _id field, why bother? Why does a record have to have 2 unique identifiers?
The second issue in your situation is that you don't expose the company's _id field to the user. Intuitively, it seems like that should be a valuable piece of information that shouldn't be given out willy-nilly. But the truth is, it has no informational value by itself, because, as stated above, a key should have no actual information. The place to implement security is in the query, ensuring that the user doing the query has permission to access the record / specific fields that she's asking for. Hiding the key is a classic security-by-obscurity that doesn't actually improve security.
The only time to hide your primary key is if you're using a poorly thought-out key that does contain useful information. For example, an invoice Id that increments by 1 for each invoice can be used by someone to figure out how many orders you get in a day. Auto-increment Ids can also be easily guessed (if my invoice is #5, can I snoop on invoice #6?). Fortunately, Mongo uses UUIDs so there's really no information leaking out (except maybe for timing attacks on its cryptographic algorithm? And if you're worried about that, you need far more in-depth security considerations than this post :-).
Look at it another way: if a slug reliably points to a specific company and user, then how is it more secure than just using the _id?
That said, there are some instances where exposing a secondary key (like slugs) is helpful, none of which have to do with security. For example, if in the future you need to migrate DB platforms and need to re-generate keys because the new platform can't use your old ones; or if users will be manually typing in identifiers, then it's helpful to give them something easier to remember like slugs. But even in those situations, you can use the slug as a handy identifier for users to use, but in your DB, you should still use the company ID to do the actual join (like in your option #2). Check out this discussion about the pros/cons of exposing _ids to users:
https://softwareengineering.stackexchange.com/questions/218306/why-not-expose-a-primary-key
So my recommendation would be to go ahead and give the user the company Id (along with the slug if you want a human-readable format e.g. for URLs, although mongo _ids can be used in a URL). They can send it back to you to get the user, and you can (after appropriate permission checks) do the join and send back the user data. If you don't want to expose the company Id, then I'd recommend your option #2, which is essentially the same thing except you're adding an additional query to first get the company Id. IMHO, that's a waste of cycles for no real improvement in security, but if there are other considerations, then it's still acceptable. And both of those options are better than using the slug as a primary key.
Second way of approach is the best,That is Add the _id of the company.
Using _id is the best way of practise to query any kind of information,even complex queries can be solved using _id as it is a unique ObjectId created by Mongodb. Population is the process of automatically replacing the specified paths in the document with document(s) from other collection(s). We may populate a single document, multiple documents, plain object, multiple plain objects, or all objects returned from a query.

Should I use ObjectID or uid(implemented by myself) to identify user?

I am new to mongodb and database.
Implement a function to make uid and use the local ObjectId.
Which is better?
You should leave ObjectID generation to the clients/drivers. This makes sure that generated IDs are unique among many things, such as time, server and process. Using the standard ObjectID also means that methods implemented by drivers (such as getTimestamp()) work.
However, if you are thinking of using your own type of ID for the _id field (ie, not the standard ObjectID type), then that makes a viable choice. For example, if you want to store information about a twitter user, then using the user's twitter ID as _id value makes perfect sense. Personally, I try to rely on the ObjectID type as little as I have to, as often collections will have a field in each document already that uniquely identifies each document.
This depends on three things:
Its source
Where and how are you using the user ID?
Personal opinion.
My personal opinion is that the object ID is good enough, however, getting back to the first and second point.
If this ID comes or is to be used in another database like an SQL database you might find using an incrementing ID a good idea, but SQL and other techs do fully support the object ID in the hexadecimal form.
If this ID is something that can be used much like an account number (think of your account number for car insurance when you phone them up) you might find an object ID too difficult for your users to remember/recounter as such a more human friendly ID might be more applicable here.
So it really depends on how this ID is being used.

use the database id as the restful service id expose a threat?

I have a restful service for the documents, where the documents are stored in mongodb, the restful api for the document is /document/:id, initially the :id in the api is using the mongodb 's object id, but I wonder deos this approach reveal the database id, and expose the potential threat, should I want to replace it with a pseudonymity id.
if it is needed to replace it the pseudonymity id, I wonder if there is a algorithmic methods for me to transform the object id and pseudonymity id back and forth without much computation
First, there is no "database id" contained in the ObjectID.
I'm assuming your concern comes from the fact that the spec lists a 3 byte machine identifier as part of the ObjectID. A couple of things to note on that:
Most of the time, the ObjectID is actually generated on the client side, not the server (though it can be). Hence this is usually the machine identifier for the application server, not your database
The 3 byte Machine ID is the first three bytes of the (md5) hash of the machine host name, or of the mac/network address, or the virtual machine id (depending on the particular implementation), so it can't be reversed back into anything particularly meaningful
With the above in mind, you can see that worrying about exposing information is not really a concern.
However, with even a small sample, it is relatively easy to guess valid ObjectIDs, so if you want to avoid that type of traffic hitting your application, then you may want to use something else (a hash of the ObjectID might be a good idea for example), but that will be dependent on your requirements.

Is it ok to turn the mongo ObjectId into a string and use it for URLs?

document/show?id=4cf8ce8a8aad6957ff00005b
Generally I think you should be cautious to expose internals (such as DB ids) to the client. The URL can easily be manipulated and the user has possibly access to objects you don't want him to have.
For MongoDB in special, the object ID might even reveal some additional internals (see here), i.e. they aren't completely random. That might be an issue too.
Besides that, I think there's no reason not to use the id.
I generally agree with #MartinStettner's reply. I wanted to add a few points, mostly elaborating what he said. Yes, a small amount of information is decodeable from the ObjectId. This is trivially accessible if someone recognizes this as a MongoDB ObjectID. The two downsides are:
It might allow someone to guess a different valid ObjectId, and request that object.
It might reveal info about the record (such as its creation date) or the server that you didn't want someone to have.
The "right" fix for the first item is to implement some sort of real access control: 1) a user has to login with a username and password, 2) the object is associated with that username, 3) the app only serves objects to a user that are associated with that username.
MongoDB doesn't do that itself; you'll have to rely on other means. Perhaps your web-app framework, and/or some ad-hoc access control list (which itself could be in MongoDB).
But here is a "quick fix" that mostly solves both problems: create some other "id" for the record, based on a large, high-quality random number.
How large does "large" need to be? A 128-bit random number has 3.4 * 10^38 possible values. So if you have 10,000,000 objects in your database, someone guessing a valid value is a vanishingly small probability: 1 in 3.4 * 10^31. Not good enough? Use a 256-bit random number... or higher!
How to represent this number in the document? You could use a string (encoding the number as hex or base64), or MongoDB's binary type. (Consult your driver's API docs to figure out how to created a binary object as part of a document.)
While you could add a new field to your document to hold this, then you'd probably also want an index. So the document size is bigger, and you spend more memory on that index. Here's what you might not have though of: simply USE that "truly random id" as your documents "_id" field. Thus the per-document size is only a little higher, and you use the index that you [probably] had there anyways.
I can set both the 128 character session string and other collection document object ids as cookies and when user visits do a asynchronous fetch where I fetch the session, user and account all at once. Instead of fetching the session first and then after fetching user, account. If the session document is valid ill share the user and account documents.
If I do this I'll have to make every single request for a user and account document require the session 128 character session cookie to be fetched too thus making exposing the user and account object id safer. It means if anyone is guessing a user ID or account ID, they also have to guess the 128 string to get any answers from the system.
Another security measure you could do is wrap the id is some salt which you only know the positioning such as
XXX4cf8ce8XXXXa8aad6957fXXXXXXXf00005bXXXX
Now you know exactly how to slice that up to get the ID.

How to separate a person's identity from his personal data?

I'm writing an app which main purpose is to keep list of users
purchases.
I would like to ensure that even I as a developer (or anyone with full
access to the database) could not figure out how much money a
particular person has spent or what he has bought.
I initially came up with the following scheme:
--------------+------------+-----------
user_hash | item | price
--------------+------------+-----------
a45cd654fe810 | Strip club | 400.00
a45cd654fe810 | Ferrari | 1510800.00
54da2241211c2 | Beer | 5.00
54da2241211c2 | iPhone | 399.00
User logs in with username and password.
From the password calculate user_hash (possibly with salting etc.).
Use the hash to access users data with normal SQL-queries.
Given enough users, it should be almost impossible to tell how much
money a particular user has spent by just knowing his name.
Is this a sensible thing to do, or am I completely foolish?
I'm afraid that if your application can link a person to its data, any developer/admin can.
The only thing you can do is making it harder to do the link, to slow the developer/admin, but if you make it harder to link users to data, you will make it harder for your server too.
Idea based on #no idea :
You can have a classic user/password login to your application (hashed password, or whatever), and a special "pass" used to keep your data secure. This "pass" wouldn't be stored in your database.
When your client log in your application I would have to provide user/password/pass. The user/password is checked with the database, and the pass would be used to load/write data.
When you need to write data, you make a hash of your "username/pass" couple, and store it as a key linking your client to your data.
When you need to load data, you make a hash of your "username/pass" couple, and load every data matching this hash.
This way it's impossible to make a link between your data and your user.
In another hand, (as I said in a comment to #no) beware of collisions. Plus if your user write a bad "pass" you can't check it.
Update : For the last part, I had another idea, you can store in your database a hash of your "pass/password" couple, this way you can check if your "pass" is okay.
Create a users table with:
user_id: an identity column (auto-generated id)
username
password: make sure it's hashed!
Create a product table like in your example:
user_hash
item
price
The user_hash will be based off of user_id which never changes. Username and password are free to change as needed. When the user logs in, you compare username/password to get the user_id. You can send the user_hash back to the client for the duration of the session, or an encrypted/indirect version of the hash (could be a session ID, where the server stores the user_hash in the session).
Now you need a way to hash the user_id into user_hash and keep it protected.
If you do it client-side as #no suggested, the client needs to have user_id. Big security hole (especially if it's a web app), hash can be easily be tampered with and algorithm is freely available to the public.
You could have it as a function in the database. Bad idea, since the database has all the pieces to link the records.
For web sites or client/server apps you could have it on your server-side code. Much better, but then one developer has access to the hashing algorithm and data.
Have another developer write the hashing algorithm (which you don't have access to) and stick in on another server (which you also don't have access to) as a TCP/web service. Your server-side code would then pass the user ID and get a hash back. You wouldn't have the algorithm, but you can send all the user IDs through to get all their hashes back. Not a lot of benefits to #3, though the service could have logging and such to try to minimize the risk.
If it's simply a client-database app, you only have choices #1 and 2. I would strongly suggest adding another [business] layer that is server-side, separate from the database server.
Edit:
This overlaps some of the previous points. Have 3 servers:
Authentication server: Employee A has access. Maintains user table. Has web service (with encrypted communications) that takes user/password combination. Hashes password, looks up user_id in table, generates user_hash. This way you can't simply send all user_ids and get back the hashes. You have to have the password which isn't stored anywhere and is only available during authentication process.
Main database server: Employee B has access. Only stores user_hash. No userid, no passwords. You can link the data using the user_hash, but the actual user info is somewhere else.
Website server: Employee B has access. Gets login info, passes to authentication server, gets hash back, then disposes login info. Keeps hash in session for writing/querying to the database.
So Employee A has user_id, username, password and algorithm. Employee B has user_hash and data. Unless employee B modifies the website to store the raw user/password, he has no way of linking to the real users.
Using SQL profiling, Employee A would get user_id, username and password hash (since user_hash is generated later in code). Employee B would get user_hash and data.
Keep in mind that even without actually storing the person's identifying information anywhere, merely associating enough information all with the same key could allow you to figure out the identity of the person associated with certain information. For a simple example, you could call up the strip club and ask which customer drove a Ferrari.
For this reason, when you de-identify medical records (for use in research and such), you have to remove birthdays for people over 89 years old (because people that old are rare enough that a specific birthdate could point to a single person) and remove any geographic coding that specifies an area containing fewer than 20,000 people. (See http://privacy.med.miami.edu/glossary/xd_deidentified_health_info.htm)
AOL found out the hard way when they released search data that people can be identified just by knowing what searches are associated with an anonymous person. (See http://www.fi.muni.cz/kd/events/cikhaj-2007-jan/slides/kumpost.pdf)
The only way to ensure that the data can't be connected to the person it belongs to is to not record the identity information in the first place (make everything anonymous). Doing this, however, would most likely make your app pointless. You can make this more difficult to do, but you can't make it impossible.
Storing user data and identifying information in separate databases (and possibly on separate servers) and linking the two with an ID number is probably the closest thing that you can do. This way, you have isolated the two data sets as much as possible. You still must retain that ID number as a link between them; otherwise, you would be unable to retrieve a user's data.
In addition, I wouldn't recommend using a hashed password as a unique identifier. When a user changes their password, you would then have to go through and update all of your databases to replace the old hashed password IDs with the new ones. It is usually much easier to use a unique ID that is not based on any of the user's information (to help ensure that it will stay static).
This ends up being a social problem, not a technological problem. The best solutions will be a social solution. After hardening your systems to guard against unauthorized access (hackers, etc), you will probably get better mileage working on establishing trust with your users and implementing a system of policies and procedures regarding data security. Include specific penalties for employees who misuse customer information. Since a single breach of customer trust is enough to ruin your reputation and drive all of your users away, the temptation of misusing this data by those with "top-level" access is less than you might think (since the collapse of the company usually outweighs any gain).
The problem is that if someone already has full access to the database then it's just a matter of time before they link up the records to particular people. Somewhere in your database (or in the application itself) you will have to make the relation between the user and the items. If someone has full access, then they will have access to that mechanism.
There is absolutely no way of preventing this.
The reality is that by having full access we are in a position of trust. This means that the company managers have to trust that even though you can see the data, you will not act in any way on it. This is where little things like ethics come into play.
Now, that said, a lot of companies separate the development and production staff. The purpose is to remove Development from having direct contact with live (ie:real) data. This has a number of advantages with security and data reliability being at the top of the heap.
The only real drawback is that some developers believe they can't troubleshoot a problem without production access. However, this is simply not true.
Production staff then would be the only ones with access to the live servers. They will typically be vetted to a larger degree (criminal history and other background checks) that is commiserate with the type of data you have to protect.
The point of all this is that this is a personnel problem; and not one that can truly be solved with technical means.
UPDATE
Others here seem to be missing a very important and vital piece of the puzzle. Namely, that the data is being entered into the system for a reason. That reason is almost universally so that it can be shared. In the case of an expense report, that data is entered so that accounting can know who to pay back.
Which means that the system, at some level, will have to match users and items without the data entry person (ie: a salesperson) being logged in.
And because that data has to be tied together without all parties involved standing there to type in a security code to "release" the data, then a DBA will absolutely be able to review the query logs to figure out who is who. And very easily I might add regardless of how many hash marks you want to throw into it. Triple DES won't save you either.
At the end of the day all you've done is make development harder with absolutely zero security benefit. I can't emphasize this enough: the only way to hide data from a dba would be for either 1. that data to only be accessible by the very person who entered it or 2. for it to not exist in the first place.
Regarding option 1, if the only person who can ever access it is the person who entered it.. well, there is no point for it to be in a corporate database.
It seems like you're right on track with this, but you're just over thinking it (or I simply don't understand it)
Write a function that builds a new string based on the input (which will be their username or something else that cant change overtime)
Use the returned string as a salt when building the user hash (again I would use the userID or username as an input for the hash builder because they wont change like the users' password or email)
Associate all user actions with the user hash.
No one with only database access can determine what the hell the user hashes mean. Even an attempt at brute forcing it by trying different seed, salt combinations will end up useless because the salt is determined as a variant of the username.
I think you've answered you own question with your initial post.
Actually, there's a way you could possibly do what you're talking about...
You could have the user type his name and password into a form that runs a purely client-side script which generates a hash based on the name and pw. That hash is used as a unique id for the user, and is sent to the server. This way the server only knows the user by hash, not by name.
For this to work, though, the hash would have to be different from the normal password hash, and the user would be required to enter their name / password an additional time before the server would have any 'memory' of what that person bought.
The server could remember what the person bought for the duration of their session and then 'forget', because the database would contain no link between the user accounts and the sensitive info.
edit
In response to those who say hashing on the client is a security risk: It's not if you do it right. It should be assumed that a hash algorithm is known or knowable. To say otherwise amounts to "security through obscurity." Hashing doesn't involve any private keys, and dynamic hashes could be used to prevent tampering.
For example, you take a hash generator like this:
http://baagoe.com/en/RandomMusings/javascript/Mash.js
// From http://baagoe.com/en/RandomMusings/javascript/
// Johannes Baagoe <baagoe#baagoe.com>, 2010
function Mash() {
var n = 0xefc8249d;
var mash = function(data) {
data = data.toString();
for (var i = 0; i < data.length; i++) {
n += data.charCodeAt(i);
var h = 0.02519603282416938 * n;
n = h >>> 0;
h -= n;
h *= n;
n = h >>> 0;
h -= n;
n += h * 0x100000000; // 2^32
}
return (n >>> 0) * 2.3283064365386963e-10; // 2^-32
};
mash.version = 'Mash 0.9';
return mash;
}
See how n changes, each time you hash a string you get something different.
Hash the username+password using a normal hash algo. This will be the same as the key of the 'secret' table in the database, but will match nothing else in the database.
Append the hashed pass to the username and hash it with the above algorithm.
Base-16 encode var n and append it in the original hash with a delimiter character.
This will create a unique hash (will be different each time) which can be checked by the system against each column in the database. The system can be set up be allow a particular unique hash only once (say, once a year), preventing MITM attacks, and none of the user's information is passed across the wire. Unless I'm missing something, there is nothing insecure about this.