mgo - bson.ObjectId vs string id - mongodb

Using mgo, it seems that best practice is to set object ids to be bson.ObjectId.
This is not very convenient, as the result is that instead of a plain string id the id is stored as binary in the DB. Googling this seems to yield tons of questions like "how do I get a string out of the bson id?", and indeed in golang there is the Hex() method of the ObjectId to allow you to get the string.
The bson becomes even more annoying to work with when exporting data from mongo to another DB platform (this is the case when dealing with big data that is collected and you want to merge it with some properties from the back office mongo DB), this means a lot of pain (you need to transform the binary ObjectId to a string in order to join with the id in different platforms that do not use bson representation).
My question is: what are the benefits of using bson.ObjectId vs string id? Will I lose anything significant if I store my mongo entities with a plain string id?

As was already mentioned in the comments, storing the ObjectId as a hex string would double the space needed for it and in case you want to extract one of its values, you'd first need to construct an ObjectId from that string.
But you have a misconception. There is absolutely no need to use an ObjectId for the mandatory _id field. Quite often, I advice against that. Here is why.
Take the simple example of a book, relations and some other considerations set aside for simplicty:
{
_id: ObjectId("56b0d36c23da2af0363abe37"),
isbn: "978-3453056657",
title: "Neuromancer",
author: "William Gibson",
language: "German"
}
Now, what use would have the ObjectId here? Actually none. It would be an index with hardly any use, since you would never search your book databases by an artificial key like that. It holds no semantic value. It would be a unique ID for an object which already has a globally unique ID – the ISBN.
So we simplify our book document like this:
{
_id: "978-3453056657",
title: "Neuromancer",
author: "William Gibson",
language: "German"
}
We have reduced the size of the document, make use of a preexisting globally unique ID and do not have a basically unused index.
Back to your basic question wether you loose something by not using ObjectIds: Quite often, not using the ObjectId is the better choice. But if you use it, use the binary form.

Related

MongoDB ObjectId vs string in find()

I'm starting to play with mongodb, and I learned that when inserting a document, you can either provide an ID, or let mongodb generate it for you.
I thought this is nice, because I want to let my users optionally choose an id, and if not generate it for them.
But the problem is, the generated one is of type ObjectId while the user provided one is a string, and the find method only returns the correct answer if you pass it with the correct type. So when a user requests GET /widget/123, I have no idea if the original ID was stored as an ObjectId or a string do I?
So how am I supposed to use this feature?
First off, I'd recommend against letting users provide _ids: if 2 users want to use the same _id, the second user will be unable to, which will be frustrating. If users want that functionality, I'd recommend storing the user created id on a separate field & querying by user (or company or whatever) and the user-created id.
That said, mongo ObjectIds are 24 hex characters, so you can safely identify when an id is not a MongoId by checking whether it doesn't match /^[a-f0-9]{24}$/ (or by seeing whether a call to ObjectId("maybeAnObjectId") throws). In the case where it's unclear (where a user might have provided 24 hex characters as their id), you'll need to use $in (or $or) to query for both cases:
const query = /^[a-f0-9]{24}$/.test(id) ? { _id: {$in: [ObjectId(id), id]}} : {_id: id}
(an annoying user could re-use an autogenerated ObjectId as their string id, and then queries to that route would return two values and there'd be no way of differentiating them).

MongoDB schema diagram

Would diagramming mongodb schema in a class diagram (UML format) be feasible as ER diagrams relate more to SQL.
When representing the id in a high level schema, which of the following 3 has the correct type: (int or objectId or _id)
id: int
OR
id: bson.ObjectId
OR
id: _id
When representing a subdocument object in a schema diagram, which of the following 2 has the correct type (String or Object)
comments : String [
{
userName : String
date : String
actualComment : String
}
]
OR
comments : Object [
{
userName : String
date : String
actualComment : String
}
]
UPDATE
If I have the following subdocument (here is JSON representation), how does Mongo store the replies - what type would it be?
comments : String [
{
userName : String
date : String
actualComment : String
replies : comment [ ] // how does mongo store nested replies
}
]
A UML class diagram is for classes in object-oriented programming and an ER diagram is for relations in a relational database. MongoDB is neither an object database nor a relational database, so neither tool is really a good fit for MongoDB. But given only those two tools, I would rather use UML class diagrams, because ER emphasizes something which should best be avoided in MongoDB: relations between documents.
By default, the _id field is filled with a generated value of type BSON ObjectId, so your second example bson.ObjectId would be technically correct if you use the default. However, you don't have to use the default. You can also explicitly set your _id fields to an own value of any type you want. So if you want to use integers for your ObjectId's for some reason (remember that you then need to take care of keeping them unique), you can of course do so and should say so in your documentation. When you don't use custom values for your _id's and also otherwise don't make any use of them, you might consider to just omit them from your diagrams, because they are implied.
In my opinion, embedded documents are best expressed in UML class diagrams by using composition (black-diamond arrow), while referenced documents are expressed using aggregation (white-diamond arrow). A sub-document is definitely not a String. Object is better, but even better would be to use the correct type.
Regarding your follow-up question: infinitely nested data structures (comments with an array of comments with an array of comments...) can be visualize in UML through a composition arrow pointing back at the box it comes from. But keep in mind that such data structures are a bad fit for MongoDB and usually best avoided. I would rather recommend you to put each comment into an own document which references the topic it belongs to and the parent comment (aggregation). But even that's not a particularly elegant solution. MongoDB isn't built for storing graphs.
Feasible, but not suitable in all cases. FK relationships can be represented the same way. For arrays, embedded, etc. you'd have to establish a representation/interpretation.
ObjectID is the type; that's a BSON type. _id is the field name. No idea how got behind int, BSON types are 32 bit integer and 64 bit integer.
None of them. It's a simply a (sub)document.
UPDATE
It's an array technically. No specific type. In that case you probably were thinking of an array of ids of comment entities, but could be anything you want I think (including subdocuments).

Configure pymongo to use string _id instead of ObjectId

I'm using pymongo to seed a database with old information from a different system, and I have a lot of queries like this:
studentId = studentsRemote.insert({'price': price})
In the actual python script, that studentId prints as a string, but in the javascript Meteor application I'm using this data in, it shows up everywhere as ObjectId(...).
I want to configure pymongo to generate the _id as a string and not bother with ObjectId's
Any objects I create with the Meteor specification will use the string format, and not the ObjectId format. I don't want to have mixing of id types in my application, because it's causing me interoperability headaches.
I'm aware I can create ObjectId's from Meteor but frankly I'd much rather use the string format. It's the Meteor default, it's much simpler, and I can't find any good reason to use ObjectId's in my particular app.
The valueOf() mongo function or something similar could parse the _id and be used to update the document once it's in the database, but it would be nice to have something more direct.
in .py files:
from bson.objectid import ObjectId
......
kvdict['_id'] = str(ObjectId())
......
mongoCollection.insert(kvdict)
it's ok!
It ended up being fairly simple.
The son_manipulator module can be used to change incoming documents to a different form. Most of the time this is used to encode custom objects, but it worked for this as well.
With the manipulator in place, it was just a matter of calling the str() function on the ObjectId to make the transformation.
from pymongo.son_manipulator import SONManipulator
class ObjectIdManipulator(SONManipulator):
def transform_incoming(self, son, collection):
son[u'_id'] = str(son[u'_id'])
return son
db.add_son_manipulator(ObjectIdManipulator())

Referencing Other Documents by String rather than ObjectId

Let's say I have two collections:
Products and Categories.
The latter collection's documents have 2 fields:
_id (BSON ObjectId)
Name (String)
The latter collection's documents have 3 fields:
_id (BSON ObjectId)
Name (String)
Products (Array of Strings)
Assume I have the following Product document:
{ "_id" : ObjectId("AAA"), "name" : "Shovel" }
Let's say I have the following Category document:
{ "_id" : ObjectId("BBB"), "Name" : "Gardening", "Products" : ["AAA"] }
For purposes of this example, assume that AAA and BBB are legitimate ObjectId's - example: ObjectId("523c7df5c30cc960b235ddee") where they would equal the inner ObjectId's string.
Should the Products field be stored as ObjectId(...)'s rather than as Strings?
I don't think it really matters that much.
I'm pretty sure that the ObjectId format encodes a hex number, so it is probably slightly more efficient with memory and bandwidth. I have done it both ways. As long as you decide, for each field, how you are going to encode it, either will work just fine.
As long as you consistently use the same type (so that comparisons happen correctly), the difference is:
An ObjectId cannot be compared to a String representation of the same ObjectId value. Thus, ObjectId("523c7df5c30cc960b235ddee") is not equal to "523c7df5c30cc960b235ddee".
ObjectIds, when stored natively, will be stored as 12 bytes, plus field name
An ObjectId, when stored as a string, will be commonly stored in 24 bytes (as it will be converted to a hexadecimal number), plus field name
Comparisons can be made more SLIGHTLY more efficiently with the 12 byte number, as it's comparing fewer bytes. It won't matter in most types of usage though, so it's a micro-optimization (but something you should know)
Bonus -- if you don't use short abbreviated field names, the size benefit of using an ObjectId natively as 12 bytes really won't matter, as the field names will far outweigh the size of bytes when stored as a string.
I'd recommend storing them as native ObjectIds. Some drivers can optionally and transparently translate to an ObjectId to a String and back so that the client code can more easily manipulate it. The C# driver for example can do this, and I've used it so that when serializing to JSON, the ObjectId is in a simple format that is easily consumed in JavaScript.
This will matter most when you try to find the details of a product starting from the Categories collection.
Since there are no server side JOIN in Mongo, your code will have to match documents together. ObjectIDs are encoded as 12 bytes, which you can easilly compare in any language. Using either strings or object ids does not really matter.
The real issue you are facing is one of data normalization (or lack thereof). If you store the Name field in your Categories documents, instead of the ObjectID, you will be able to return the products names in a single call (instead of multiple calls, 1 for each products of the category).
It feels wrong the first time you do it. After all, you will have to update many documents if you ever change the name of a product, which might or might not be frequent. You have to model your data by thinking of the way your application will use it.
Finally, index the Name attribute in the Prodcuts collection. Getting the details of a product, starting with the string you found in a Categories document will be fast.
Another way to do it is to not to have a Categories collection at all, but to add a Category attribute to your Products document. You can find documents that have the {'Category':'Gardening'}. Indexing the Category field will probably be a good idea.
Again, ObjectID or String does not matter much. It is about modeling your data thinking of how your application will use it.

Confusion regarding Mongo db Schema. How to make it better?

I am using mongoose with node.js for this.
My current Schema is this:
var linkSchema = new Schema({
text: String,
tags: array,
body: String,
user: String
})
My use-case is this: There are a list of users and each user has a list of links associated with it. Users and links are different Schemas of course. Thus, how does one get that sort of one to one relationship done using mongo-db.
Should I make a User Schema and embed linkSchema in it? Or the other way around?
Another doubt regarding that. Tags would always be an array of strings which I can use to browse through links later. Should it be an array data type or is there a better way to represent it?
If it's 1:1 then nest one document inside the other. Which way around depends on the queries, but you could easily do both if you need to.
For tags, you can index an array field and use that for searching/filtering documents and from the information you've given that sounds reasonable IMHO.
If you had a fixed set of tags it would make sense to represent those as a nested object with named fields perhaps, depending on queries. Don't forget you not only can create nested documents in Mongo but you can also search on sub-fields and even use entire nested documents as searchable/indexable fields. For instance, you could have a username like this;
email: "joe#somewhere.com"
as a string, and you could also do;
email: {
user: "joe",
domain: "somewhere.com"
}
you could index email in both cases and use either for matching. In the latter case though you could also search on domain or user only without resorting to RegEx style queries. You could also store both variants, so there's lots of flexibile options in Mongo.
Going back to tags, I think your array of strings is a fine model given what you've described, but if you were doing more complex bulk aggregation, it wouldn't be crazy to store a document for every tag with the same document contents, since that's essentially what you'd have to do for every query during aggregation.