Referencing Other Documents by String rather than ObjectId - mongodb

Let's say I have two collections:
Products and Categories.
The latter collection's documents have 2 fields:
_id (BSON ObjectId)
Name (String)
The latter collection's documents have 3 fields:
_id (BSON ObjectId)
Name (String)
Products (Array of Strings)
Assume I have the following Product document:
{ "_id" : ObjectId("AAA"), "name" : "Shovel" }
Let's say I have the following Category document:
{ "_id" : ObjectId("BBB"), "Name" : "Gardening", "Products" : ["AAA"] }
For purposes of this example, assume that AAA and BBB are legitimate ObjectId's - example: ObjectId("523c7df5c30cc960b235ddee") where they would equal the inner ObjectId's string.
Should the Products field be stored as ObjectId(...)'s rather than as Strings?

I don't think it really matters that much.
I'm pretty sure that the ObjectId format encodes a hex number, so it is probably slightly more efficient with memory and bandwidth. I have done it both ways. As long as you decide, for each field, how you are going to encode it, either will work just fine.

As long as you consistently use the same type (so that comparisons happen correctly), the difference is:
An ObjectId cannot be compared to a String representation of the same ObjectId value. Thus, ObjectId("523c7df5c30cc960b235ddee") is not equal to "523c7df5c30cc960b235ddee".
ObjectIds, when stored natively, will be stored as 12 bytes, plus field name
An ObjectId, when stored as a string, will be commonly stored in 24 bytes (as it will be converted to a hexadecimal number), plus field name
Comparisons can be made more SLIGHTLY more efficiently with the 12 byte number, as it's comparing fewer bytes. It won't matter in most types of usage though, so it's a micro-optimization (but something you should know)
Bonus -- if you don't use short abbreviated field names, the size benefit of using an ObjectId natively as 12 bytes really won't matter, as the field names will far outweigh the size of bytes when stored as a string.
I'd recommend storing them as native ObjectIds. Some drivers can optionally and transparently translate to an ObjectId to a String and back so that the client code can more easily manipulate it. The C# driver for example can do this, and I've used it so that when serializing to JSON, the ObjectId is in a simple format that is easily consumed in JavaScript.

This will matter most when you try to find the details of a product starting from the Categories collection.
Since there are no server side JOIN in Mongo, your code will have to match documents together. ObjectIDs are encoded as 12 bytes, which you can easilly compare in any language. Using either strings or object ids does not really matter.
The real issue you are facing is one of data normalization (or lack thereof). If you store the Name field in your Categories documents, instead of the ObjectID, you will be able to return the products names in a single call (instead of multiple calls, 1 for each products of the category).
It feels wrong the first time you do it. After all, you will have to update many documents if you ever change the name of a product, which might or might not be frequent. You have to model your data by thinking of the way your application will use it.
Finally, index the Name attribute in the Prodcuts collection. Getting the details of a product, starting with the string you found in a Categories document will be fast.
Another way to do it is to not to have a Categories collection at all, but to add a Category attribute to your Products document. You can find documents that have the {'Category':'Gardening'}. Indexing the Category field will probably be a good idea.
Again, ObjectID or String does not matter much. It is about modeling your data thinking of how your application will use it.

Related

Can mongo document ID format be customised?

I would like to encode some meaning behinds first N characters of every document ID i.e. make first three characters determine a document type sensible to the system being used in.
You can have a custom _id when you insert the document. If the document to be inserted doesn't contain _id, then MongoDB will insert a ObejctId for you.
The _id can be of any type but keeping it uniform for all the documents makes sense if you are accessing from application layer.
You can refer one of the old questions at SO - How to generate unique object id in mongodb

MongoDB schema diagram

Would diagramming mongodb schema in a class diagram (UML format) be feasible as ER diagrams relate more to SQL.
When representing the id in a high level schema, which of the following 3 has the correct type: (int or objectId or _id)
id: int
OR
id: bson.ObjectId
OR
id: _id
When representing a subdocument object in a schema diagram, which of the following 2 has the correct type (String or Object)
comments : String [
{
userName : String
date : String
actualComment : String
}
]
OR
comments : Object [
{
userName : String
date : String
actualComment : String
}
]
UPDATE
If I have the following subdocument (here is JSON representation), how does Mongo store the replies - what type would it be?
comments : String [
{
userName : String
date : String
actualComment : String
replies : comment [ ] // how does mongo store nested replies
}
]
A UML class diagram is for classes in object-oriented programming and an ER diagram is for relations in a relational database. MongoDB is neither an object database nor a relational database, so neither tool is really a good fit for MongoDB. But given only those two tools, I would rather use UML class diagrams, because ER emphasizes something which should best be avoided in MongoDB: relations between documents.
By default, the _id field is filled with a generated value of type BSON ObjectId, so your second example bson.ObjectId would be technically correct if you use the default. However, you don't have to use the default. You can also explicitly set your _id fields to an own value of any type you want. So if you want to use integers for your ObjectId's for some reason (remember that you then need to take care of keeping them unique), you can of course do so and should say so in your documentation. When you don't use custom values for your _id's and also otherwise don't make any use of them, you might consider to just omit them from your diagrams, because they are implied.
In my opinion, embedded documents are best expressed in UML class diagrams by using composition (black-diamond arrow), while referenced documents are expressed using aggregation (white-diamond arrow). A sub-document is definitely not a String. Object is better, but even better would be to use the correct type.
Regarding your follow-up question: infinitely nested data structures (comments with an array of comments with an array of comments...) can be visualize in UML through a composition arrow pointing back at the box it comes from. But keep in mind that such data structures are a bad fit for MongoDB and usually best avoided. I would rather recommend you to put each comment into an own document which references the topic it belongs to and the parent comment (aggregation). But even that's not a particularly elegant solution. MongoDB isn't built for storing graphs.
Feasible, but not suitable in all cases. FK relationships can be represented the same way. For arrays, embedded, etc. you'd have to establish a representation/interpretation.
ObjectID is the type; that's a BSON type. _id is the field name. No idea how got behind int, BSON types are 32 bit integer and 64 bit integer.
None of them. It's a simply a (sub)document.
UPDATE
It's an array technically. No specific type. In that case you probably were thinking of an array of ids of comment entities, but could be anything you want I think (including subdocuments).

mgo - bson.ObjectId vs string id

Using mgo, it seems that best practice is to set object ids to be bson.ObjectId.
This is not very convenient, as the result is that instead of a plain string id the id is stored as binary in the DB. Googling this seems to yield tons of questions like "how do I get a string out of the bson id?", and indeed in golang there is the Hex() method of the ObjectId to allow you to get the string.
The bson becomes even more annoying to work with when exporting data from mongo to another DB platform (this is the case when dealing with big data that is collected and you want to merge it with some properties from the back office mongo DB), this means a lot of pain (you need to transform the binary ObjectId to a string in order to join with the id in different platforms that do not use bson representation).
My question is: what are the benefits of using bson.ObjectId vs string id? Will I lose anything significant if I store my mongo entities with a plain string id?
As was already mentioned in the comments, storing the ObjectId as a hex string would double the space needed for it and in case you want to extract one of its values, you'd first need to construct an ObjectId from that string.
But you have a misconception. There is absolutely no need to use an ObjectId for the mandatory _id field. Quite often, I advice against that. Here is why.
Take the simple example of a book, relations and some other considerations set aside for simplicty:
{
_id: ObjectId("56b0d36c23da2af0363abe37"),
isbn: "978-3453056657",
title: "Neuromancer",
author: "William Gibson",
language: "German"
}
Now, what use would have the ObjectId here? Actually none. It would be an index with hardly any use, since you would never search your book databases by an artificial key like that. It holds no semantic value. It would be a unique ID for an object which already has a globally unique ID – the ISBN.
So we simplify our book document like this:
{
_id: "978-3453056657",
title: "Neuromancer",
author: "William Gibson",
language: "German"
}
We have reduced the size of the document, make use of a preexisting globally unique ID and do not have a basically unused index.
Back to your basic question wether you loose something by not using ObjectIds: Quite often, not using the ObjectId is the better choice. But if you use it, use the binary form.

Does replacing the "_id" field in mongodb with a custom unique key decrease performance?

I have a situation in which I have a User schema that contains a unique field called "username." At the same time, mongo automatically creates its own unique key, "_id."
I've noticed that for a lot of my schemas I need both an array of "usernames" as well as "ids". This is quite redundant sometimes so my question is:
Is a lookup via "_id" faster than a lookup for a field "username" (let's say a 10 character string)? If they are the same, is it viable to use my unique identifier username for the value of _id?
If your data naturally has a required, unique field, then it's perfectly fine to use that value as your _id.
As long as the field's data is comparable in size to an ObjectId (which is 12 bytes), then performance should be the same. A 10 character string is 20 bytes, so the index for username will take a bit more memory, but probably not enough to make a difference performance-wise.
Since you're using Mongoose, you could also create a virtual field (named username) that exposes the _id field using that more descriptive name, as well.
I think this is fine, UNLESS you will be changing the structure of your usernames in the future. Thus I think it's better to just stick with ObjectId() for the ID and then stick an extra field username if you need it.

Change size of Objectid

In MongoDb ObjectId is a 12-byte BSON type.
Is there any way to reduce the size of objectID?
No. It's a BSON data type. It's like asking a 32-bit integer to shrink itself.
Every object must have _id property, but you are not restricted to ObjectId.
Every document in a MongoDB collection needs to have a unique _id but the value does not have to be an ObjectId. Therefore, if you are looking to reduce the size of documents in your collection you have two choices:
Pick one of the unique properties of your documents and use it as the _id field. For example, if you have an accounts collection where the account ID--provided externally--is part of your data model, you could store the account ID in the _id field.
Manage primary keys for the collection yourself. Many drivers support custom primary key factories. As #assylias suggests, going with an int will give you good space savings but, still, you will use more space than if you can use one of the fields in your model as the _id.
BTW, the value of an _id field can be composite: you can use an Object/hash/map/dictionary. See, for example, this SO question.
If you are using some type of object/model framework on top of Mongo, I'd be careful with (1). Some frameworks have a hard time with developers overriding id generation. For example, I've had bad experience with Mongoid in Ruby. In that case, (2) may be the safer way to go as the generation happens at the driver layer.