To relate one record to another in MongoDB, is it ok to use a slug? - mongodb

Let's say we have two models like this:
User:
_ _id
- name
- email
Company:
- _id
_ name
_ slug
Now let's say I need to connect a user to the company. A user can have one company assigned. To do this, I can add a new field called companyID in the user model. But I'm not sending the _id field to the front end. All the requests that come to the API will have the slug only. There are two ways I can do this:
1) Add slug to relate the company: If I do this, I can take the slug sent from a request and directly query for the company.
2) Add the _id of the company: If I do this, I need to first use the slug to query for the company and then use the _id returned to query for the required data.
May I please know which way is the best? Is there any extra benefit when using the _id of a record for the relationship?

Agree with the 2nd approach. There are several issues to consider when deciding on which field to use as a join key (this is true of all DBs, not just Mongo):
The field must be unique. I'm not sure exactly what the 'slug' field in your schema represents, but if there is any chance this could be duplicated, then don't use it.
The field must not change. Strictly speaking, you can change a key field but the only way to safely do so is to simultaneously change it in all the child tables atomically. This is a difficult thing to do reliably because a) you have to know which tables are using the field (maybe some other developer added another table that you're not aware of) b) If you do it one at a time, you'll introduce race conditions c) If any of the updates fail, you'll have inconsistent data and corrupted parent-child links. Some SQL DBs have a cascading-update feature to solve this problem, but Mongo does not. It's a hard enough problem that you really, really don't want to change a key field if you don't have to.
The field must be indexed. Strictly speaking this isn't true, but if you're going to join on it, then you will be running a lot of queries on it, so you'll need to index it.
For these reasons, it's almost always recommended to use a key field that serves solely as a key field, with no actual information stored in it. Plenty of people have been burned using things like Social Security Numbers, drivers licenses, etc. as key fields, either because there can be duplicates (e.g. SSNs can be duplicated if people are using fake numbers, or if they don't have one), or the numbers can change (e.g. drivers licenses).
Plus, by doing so, you can format the key field to optimize for speed of unique generation and indexing. For example, if you use SSNs, you need to check the SSN against the rest of the DB to ensure it's unique. That takes time if you have millions of records. Similarly for slugs, which are text fields that need to be hashed and checked against an index. OTOH, mongoDB essentially uses UUIDs as keys, which means it doesn't have to check for uniqueness (the algorithm guarantees a high statistical likelihood of uniqueness).
The bottomline is that there are very good reasons not to use a "real" field as your key if you can help it. Fortunately for you, mongoDB already gives you a great key field which satisfies all the above criteria, the _id field. Therefore, you should use it. Even if slug is not a "real" field and you generate it the exact same way as an _id field, why bother? Why does a record have to have 2 unique identifiers?
The second issue in your situation is that you don't expose the company's _id field to the user. Intuitively, it seems like that should be a valuable piece of information that shouldn't be given out willy-nilly. But the truth is, it has no informational value by itself, because, as stated above, a key should have no actual information. The place to implement security is in the query, ensuring that the user doing the query has permission to access the record / specific fields that she's asking for. Hiding the key is a classic security-by-obscurity that doesn't actually improve security.
The only time to hide your primary key is if you're using a poorly thought-out key that does contain useful information. For example, an invoice Id that increments by 1 for each invoice can be used by someone to figure out how many orders you get in a day. Auto-increment Ids can also be easily guessed (if my invoice is #5, can I snoop on invoice #6?). Fortunately, Mongo uses UUIDs so there's really no information leaking out (except maybe for timing attacks on its cryptographic algorithm? And if you're worried about that, you need far more in-depth security considerations than this post :-).
Look at it another way: if a slug reliably points to a specific company and user, then how is it more secure than just using the _id?
That said, there are some instances where exposing a secondary key (like slugs) is helpful, none of which have to do with security. For example, if in the future you need to migrate DB platforms and need to re-generate keys because the new platform can't use your old ones; or if users will be manually typing in identifiers, then it's helpful to give them something easier to remember like slugs. But even in those situations, you can use the slug as a handy identifier for users to use, but in your DB, you should still use the company ID to do the actual join (like in your option #2). Check out this discussion about the pros/cons of exposing _ids to users:
https://softwareengineering.stackexchange.com/questions/218306/why-not-expose-a-primary-key
So my recommendation would be to go ahead and give the user the company Id (along with the slug if you want a human-readable format e.g. for URLs, although mongo _ids can be used in a URL). They can send it back to you to get the user, and you can (after appropriate permission checks) do the join and send back the user data. If you don't want to expose the company Id, then I'd recommend your option #2, which is essentially the same thing except you're adding an additional query to first get the company Id. IMHO, that's a waste of cycles for no real improvement in security, but if there are other considerations, then it's still acceptable. And both of those options are better than using the slug as a primary key.

Second way of approach is the best,That is Add the _id of the company.
Using _id is the best way of practise to query any kind of information,even complex queries can be solved using _id as it is a unique ObjectId created by Mongodb. Population is the process of automatically replacing the specified paths in the document with document(s) from other collection(s). We may populate a single document, multiple documents, plain object, multiple plain objects, or all objects returned from a query.

Related

Is it possible to group multiple collections in mongodb

so I'm working with a database that has multiple collections and some of the data overlaps in the collection . In particular I have a collection called app-launches which contains a field called userId and one called users where the _id of a particular object is actually the same as the userId in app-launches. Is it possible to group the two collections together so I can analyze the data? Or maybe match the the userId in app-launches with the _id in users?
There is no definit answer for your question Jeffrey and none of the experts here can tell you to choose which technique over other just by having this information.
After going through various web pages over internet and mongo documentation and understanding the design patterns used in Mongo over a period of time, How I would design it depends on few things which I can try explaining it here in short.
if you have a One-To-One relation then always prefer to choose Embedding over Linking. e.g. User and its address (assuming user has only one address) thus you can utilize the atomicity (without worrying about transactions) as well easily fetch the records without too and fro to bring other information as in the case of Linking (like in DBRef)
If you have One-To-Many relation then you need to consider whether you can do the stuff by using Embedding (prefer this as explained the benefits in point 1). However, embedding would help you if you always want the information altogether e.g. Post/Comments where your requirement is to get the post and all of its comments by postId let say. But think of a situation where you need to get all the comments (and it related posts) which contains some specific tags in comments. in this case you should prefer Linking Because if you go via Embedding route then you would end up getting all the collection of comments for a post and you have to filter the desired comments.
for a Many-To-Many relations I would prefer two separate entities as well another collection for linking them e.g. Product-Category.
-$

Should I use ObjectID or uid(implemented by myself) to identify user?

I am new to mongodb and database.
Implement a function to make uid and use the local ObjectId.
Which is better?
You should leave ObjectID generation to the clients/drivers. This makes sure that generated IDs are unique among many things, such as time, server and process. Using the standard ObjectID also means that methods implemented by drivers (such as getTimestamp()) work.
However, if you are thinking of using your own type of ID for the _id field (ie, not the standard ObjectID type), then that makes a viable choice. For example, if you want to store information about a twitter user, then using the user's twitter ID as _id value makes perfect sense. Personally, I try to rely on the ObjectID type as little as I have to, as often collections will have a field in each document already that uniquely identifies each document.
This depends on three things:
Its source
Where and how are you using the user ID?
Personal opinion.
My personal opinion is that the object ID is good enough, however, getting back to the first and second point.
If this ID comes or is to be used in another database like an SQL database you might find using an incrementing ID a good idea, but SQL and other techs do fully support the object ID in the hexadecimal form.
If this ID is something that can be used much like an account number (think of your account number for car insurance when you phone them up) you might find an object ID too difficult for your users to remember/recounter as such a more human friendly ID might be more applicable here.
So it really depends on how this ID is being used.

Informative vs unique generated ID in REST API

Designing a RESTful API. I have two ways of identifying resources (person data). Either by the unique ID generated by the database, or by a social security number (SSN), entered for each person. The SSN is supposedly unique, though can be changed.
Using the ID would be most convenient for me, since it is guaranteed to be unique, and does not change. Hence the URL for the resource, also always stays the same:
GET /persons/12
{
"name": Morgan
"ssn": "840212-3312"
}
The argument for using SSN, is that it is more informative and understandable by API clients. SSN is also used more in surrounding systems:
GET /persons/840212-3321
{
"name": Morgan
"id": "12"
}
So the question is: Should I go with the first approach, and avoid some implementation headaches where the SSN may change. And maybe provide some helper method that converts from SSN to ID?
Or go with the second approach. Providing a more informative API. Though having to deal with some not so RESTful strangeness where URL:s might change due to SSN changes?
URL design is a personal choice. But to give you some more examples which differ from those Ray has already provided, I will give you some of my own.
I have a user account resource and allow access via both URIs:
/users/12
and
/users/morgan
where the numerical value is an auto_incremented ID, and the alphabetic value is a unique username on the system specified by the user. these resources are uncachable so I do not bother about canonicalisation, however the /users page links to the alphabetic forms.
No other resources on my system have two unique fields, so are referred to by IDs, /jobs/123, /quotations/456.
As you can see, I prefer plural URI segments ;-)
I think of "job 123" as being from the "jobs" collection, so it seems logical to have a "jobs" resource, with subresources for each job.
You do not need to have a separate /search/ area for performing searches, I think it would be cleaner to apply your search criteria to the collection resource directly:
/people?ssn=123456-7890 (people with SSN matching/containing "123456-7890")
/people?name=morgan (people who's name is/contains "Morgan")
I have something similar, but use only the first letter as a filter:
/sites?alpha=f
Lists all sites beginning with F. You can think of it as a filter, or as a search criteria, those terms are just different sides of the same coin.
Good to see someone taking time to think about their Resource urls!
I would make a Url with the unique id to provide resource to a single user. Like:
http://api.mysite.com/person/12/
Where 12 is your unique ID. Note that I also prefer the singular 'person'....
Regardless, the url should return:
{
"ssn": "840212-3312"
"name": "Morgan"
"id": "12"
}
However, I would also create a general search URL that returns a list of users that match the parameters (either a json array or whatever format you need). You can specify search parameters as get params like this:
http://api.mysite.com/person/search/?ssn=840212-3312
Or
http://api.mysite.com/person/search/?name=Morgan
These would return something like this for a single search hit--note it's an array, not a single item like the unique id url that points directly to a single user.
[{
"ssn": "840212-3312"
"name": "Morgan"
"id": "12"
}]
This search could then be later augmented for other search criteria. You might only return the unique id's via the search Url--you could always make a request to the unique id url once you've got it from the search...
I would suggest that you use neither. Generate resource IDs that are unique both to a single user of your API and across all other resources (including other users' resources).
Using the unique database ID is not ideal for a couple of reasons. First, API resources and database records won't necessarily always be 1-to-1 even if you have designed it that way today. Second, you might change to a different data store that would generate different format unique ids.
Also, it is good practice to separate out the ID from other resource properties, such as SSN (as an aside I hope you are storing SSN in a very secure manner, but that's another topic). If for whatever reason an SSN changed, more than one API resource was associated with the same SSN, or you decide that piece of data is not needed someday, you don't want to have to change the ID.
One pattern is to prepend the unique ID with a few characters that indicate the resource type. For example if User is a resource type in you API, a generated unique ID would be something like USR56382.
RESTful API is an architectural style which emphasizes on resource centric design approach.
In my opinion, I would keep the resources as plural and noun format.
Every resource, for example, customers has following uniform interfaces
POST /customers - for creating a resource instance
PUT /customers/{customerId} - for updating a particular instance
GET /customers - is for search customers. So #Ray, search is not required to be part of URI itself. Any filter or query parameters that need to be supported should be there itself.
GET /customers/{customerId} - to retrieve a particular instance of customer
DELETE /customers/{customerId} to delete a particular instance
The reason why plural, it is because it behaves as a factory. For example, when u r trying to create a new instance of a resource, the instance does not exist and therefore, it cannot be on the self instance. Hence, singularity is not used.
It also goes hand-in-hand for search/inquiry, where you do not know or hold the actual instance of resource. Hence, the plural form is much recommended.
Now, the question is what to use for a resource id - a database primary key, a generated identifier, or an encrypted token.
In my opinion, database primary keys should not be exposed. Resource identifier should not be designed 1-1 with DB primary key. But, it happens a lot. A generated UUID based key is much more recommended to avoid any sequential follow-through attack but world is not ideal always.
Coming back to token or an encrypted token, is a recommended approach for sensitive APIs, and where data exchange is performed between two separate applications. If we are using it, the encryption/decryption should be solely at the API end. That means, the encrypted keys for sub-resources should be returned as part of parent API response, otherwise it defeats the purpose.

Ensure data coherence across documents in MongoDB

I'm working on a Rails app that implements some social network features as relationships, following, etc. So far everything was fine until I came across with a problem on many to many relations. As you know mongo lacks of joins, so the recommended workaround is to store the relation as an array of ids on both related documents. OK, it's a bit redundant but it should work, let's say:
field :followers, type: Array, default: []
field :following, type: Array, default: []
def follow!(who)
self.followers << who.id
who.following << self.id
self.save
who.save
end
That works pretty well, but this is one of those cases where we would need a transaction, uh, but mongo doesn't support transactions. What if the id is added to the 'followed' followers list but not to the 'follower' following list? I mean, if the first document is modified properly but the second for some reason can't be updated.
Maybe I'm too pessimistic, but there isn't a better solution?
I would recommend storing relationships only in one direction, storing the users someone follows in their user document as "following". Then if you need to query for all followers of user U1, you can query for {users.following : "U1"} Since you can have a multi-key index on an array, this query will be fast if you index this field.
The other reason to go in that direction only is a single user has a practical limit to how many different users they may be following. But the number of followers that a really popular user may have could be close to the total number of users in your system. You want to avoid creating an array in a document that could be that large.

Is it ok to turn the mongo ObjectId into a string and use it for URLs?

document/show?id=4cf8ce8a8aad6957ff00005b
Generally I think you should be cautious to expose internals (such as DB ids) to the client. The URL can easily be manipulated and the user has possibly access to objects you don't want him to have.
For MongoDB in special, the object ID might even reveal some additional internals (see here), i.e. they aren't completely random. That might be an issue too.
Besides that, I think there's no reason not to use the id.
I generally agree with #MartinStettner's reply. I wanted to add a few points, mostly elaborating what he said. Yes, a small amount of information is decodeable from the ObjectId. This is trivially accessible if someone recognizes this as a MongoDB ObjectID. The two downsides are:
It might allow someone to guess a different valid ObjectId, and request that object.
It might reveal info about the record (such as its creation date) or the server that you didn't want someone to have.
The "right" fix for the first item is to implement some sort of real access control: 1) a user has to login with a username and password, 2) the object is associated with that username, 3) the app only serves objects to a user that are associated with that username.
MongoDB doesn't do that itself; you'll have to rely on other means. Perhaps your web-app framework, and/or some ad-hoc access control list (which itself could be in MongoDB).
But here is a "quick fix" that mostly solves both problems: create some other "id" for the record, based on a large, high-quality random number.
How large does "large" need to be? A 128-bit random number has 3.4 * 10^38 possible values. So if you have 10,000,000 objects in your database, someone guessing a valid value is a vanishingly small probability: 1 in 3.4 * 10^31. Not good enough? Use a 256-bit random number... or higher!
How to represent this number in the document? You could use a string (encoding the number as hex or base64), or MongoDB's binary type. (Consult your driver's API docs to figure out how to created a binary object as part of a document.)
While you could add a new field to your document to hold this, then you'd probably also want an index. So the document size is bigger, and you spend more memory on that index. Here's what you might not have though of: simply USE that "truly random id" as your documents "_id" field. Thus the per-document size is only a little higher, and you use the index that you [probably] had there anyways.
I can set both the 128 character session string and other collection document object ids as cookies and when user visits do a asynchronous fetch where I fetch the session, user and account all at once. Instead of fetching the session first and then after fetching user, account. If the session document is valid ill share the user and account documents.
If I do this I'll have to make every single request for a user and account document require the session 128 character session cookie to be fetched too thus making exposing the user and account object id safer. It means if anyone is guessing a user ID or account ID, they also have to guess the 128 string to get any answers from the system.
Another security measure you could do is wrap the id is some salt which you only know the positioning such as
XXX4cf8ce8XXXXa8aad6957fXXXXXXXf00005bXXXX
Now you know exactly how to slice that up to get the ID.